Journal of Artificial Intelligence Research 12 (2000) 149-198 



Submitted 11/99; pubUshed 3/00 



A Model of Inductive Bias Learning 

Jonathan Baxter Jonathan.Baxter@anu.edu.au 
Research School of Information Sciences and Engineering 
Australian National University, Canberra 0200, Australia 

Abstract 

A major problem in machine learning is tliat of inductive bias: how to choose a learner's hy- 
pothesis space so that it is large enough to contain a solution to the problem being leamt, yet small 
enough to ensure reliable generalization from reasonably- sized training sets. Typically such bias is 
suppUed by hand through the skill and insights of experts. In this paper a model for automatically 
learning bias is investigated. The central assumption of the model is that the learner is embedded 
within an environment of related leaming tasks. Within such an environment the learner can sample 
from multiple tasks, and hence it can search for a hypothesis space that contains good solutions to 
many of the problems in the envirormient. Under certain restrictions on the set of all hypothesis 
spaces available to the learner, we show that a hypothesis space that performs well on a sufficiently 
large number of training tasks will also perform well when learning novel tasks in the same en- 
viroiraient. Explicit bounds are also derived demonstrating that leaming multiple tasks within an 
enviroimient of related tasks can potentially give much better generalization than leaming a single 
task. 

1. Introduction 

Often the hardest problem in any machine leaming task is the initial choice of hypothesis space; 
it has to be large enough to contain a solution to the problem at hand, yet small enough to ensure 
good generalization from a small number of examples (Mitchell, 1991). Once a suitable bias has 
been found, the actual learning task is often straightforward. Existing methods of bias generally 
require the input of a human expert in the form of heuristics and domain knowledge (for example, 
through the selection of an appropriate set of features). Despite their successes, such methods are 
clearly limited by the accuracy and reliability of the expert's knowledge and also by the extent to 
which that knowledge can be transferred to the learner Thus it is natural to search for methods for 
automatically learning the bias. 

In this paper we introduce and analyze a formal model of bias learning that builds upon 
the PAC model of machine leaming and its variants (Vapnik, 1982; Valiant, 1984; Blumer, 
Ehrenfeucht, Haussler, & Warmuth, 1989; Haussler, 1992). These models typically take the 
following general form: the learner is supplied with a hypothesis space H and training data 
z = {{xi,yi), . . . , {xm,ym)} drawn independently according to some underlying distribution P 
on X X y. Based on the information contained in z, the learner's goal is to select a hypothesis 
h: X ^ Y from H minimizing some measure evp{h) of expected loss with respect to P (for ex- 
ample, in the case of squared loss eip{h) := j^)^p ih{x) — y)^). In such models the leamer's 
bias is represented by the choice ofH'AfH does not contain a good solution to the problem, then, 
regardless of how much data the learner receives, it cannot leam. 

Of course, the best way to bias the leamer is to supply it with an T-L containing just a single op- 
timal hypothesis. But finding such a hypothesis is precisely the original leaming problem, so in the 
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PAC model there is no distinction between bias learning and ordinary learning. Or put differently, 
the PAC model does not model the process of inductive bias, it simply takes the hypothesis space T-L 
as given and proceeds from there. To overcome this problem, in this paper we assume that instead 
of being faced with just a single learning task, the learner is embedded within an environment of 
related learning tasks. The learner is supplied with & family of hypothesis spaces H = {H}, and its 
goal is to find a bias (i.e. hypothesis space T-L ^ W) that is appropriate for the entire environment. 
A simple example is the problem of handwritten character recognition. A preprocessing stage that 
identifies and removes any (small) rotations, dilations and translations of an image of a character 
will be advantageous for recognizing all characters. If the set of all individual character recognition 
problems is viewed as an environment of learning problems (that is, the set of all problems of the 
form "distinguish A' from all other characters", "distinguish 'B' from all other characters", and 
so on), this preprocessor represents a bias that is appropriate for all problems in the environment. 
It is Ukely that there are many other currently unknown biases that are also appropriate for this 
environment. We would Uke to be able to learn these automatically. 

There are many other examples of learning problems that can be viewed as belonging to envi- 
ronments of related problems. For example, each individual face recognition problem belongs to an 
(essentially infinite) set of related learning problems (all the other individual face recognition prob- 
lems); the set of all individual spoken word recognition problems forms another large environment, 
as does the set of all fingerprint recognition problems, printed Chinese and Japanese character recog- 
nition problems, stock price prediction problems and so on. Even medical diagnostic and prognostic 
problems, where a multitude of diseases are predicted from the same pathology tests, constitute an 
environment of related learning problems. 

In many cases these "environments" are not normally modeled as such; instead they are treated 
as single, multiple category learning problems. For example, recognizing a group of faces would 
normally be viewed as a single learning problem with multiple class labels (one for each face in 
the group), not as multiple individual learning problems. However, if a reliable classifier for each 
individual face in the group can be constructed then they can easily be combined to produce a 
classifier for the whole group. Furthermore, by viewing the faces as an environment of related 
learning problems, the results presented here show that bias can be learnt that will be good for 
learning novel faces, a claim that cannot be made for the traditional approach. 

This point goes to the heart of our model: we are not not concerned with adjusting a learner's 
bias so it performs better on some fixed set of learning problems. Such a process is in fact just 
ordinary learning but with a richer hypothesis space in which some components labelled "bias" are 
also able to be varied. Instead, we suppose the learner is faced with a (potentially infinite) stream of 
tasks, and that by adjusting its bias on some subset of the tasks it improves its learning performance 
on future, as yet unseen tasks. 

Bias that is appropriate for all problems in an environment must be learnt by sampling from 
many tasks. If only a single task is learnt then the bias extracted is likely to be specific to that 
task. In the rest of this paper, a general theory of bias learning is developed based upon the idea of 
learning multiple related tasks. Loosely speaking (formal results are stated in Section 2), there are 
two main conclusions of the theory presented here: 

• Learning multiple related tasks reduces the sampling burden required for good generalization, 
at least on a number-of-examples-required-per-task basis. 
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• Bias that is learnt on sufficiently many training tasks is likely to be good for learning novel 
tasks drawn from the same environment. 

The second point shows that a form of meta- generalization is possible in bias learning. Or- 
dinarily, we say a learner generahzes well if, after seeing sufficiently many training examples, it 
produces a hypothesis that with high probability will perform well on future examples of the same 
task. However, a bias learner generalizes well if, after seeing sufficiently many training tasks it pro- 
duces a hypothesis space that with high probability contains good solutions to novel tasks. Another 
term that has been used for this process is Learning to Learn (Thrun & Pratt, 1997). 

Our main theorems are stated in an agnostic setting (that is, M does not necessarily contain a 
hypothesis space with solutions to all the problems in the environment), but we also give improved 
bounds in the realizable case. The sample complexity bounds appearing in these results are stated 
in terms of combinatorial parameters related to the complexity of the set of all hypothesis spaces H 
available to the bias learner. For Boolean learning problems (pattern classification) these parameters 
are the bias leaming analogue of the Vapnik-Chervonenkis dimension (Vapnik, 1982; Blumer et al., 
1989). 

As an application of the general theory, the problem of learning an appropriate set of neural- 
network features for an environment of related tasks is formulated as a bias learning problem. In 
the case of continuous neural-network features we are able to prove upper bounds on the number 
of training tasks and number of examples of each training task required to ensure a set of features 
that works well for the training tasks will, with high probability, work well on novel tasks drawn 
from the same environment. The upper bound on the number of tasks scales as 0(6) where b is 
a measure of the complexity of the possible feature sets available to the learner, while the upper 
bound on the number of examples of each task scales as 0{a + b/n) where 0{a) is the number 
of examples required to learn a task if the "true" set of features (that is, the correct bias) is already 
known, and n is the number of tasks. Thus, in this case we see that as the number of related tasks 
learnt increases, the number of examples required of each task for good generalization decays to 
the minimum possible. For Boolean neural-network feature maps we are able to show a matching 
lower bound on the number of examples required per task of the same form. 

1.1 Related Work 

There is a large body of previous algorithmic and experimental work in the machine leaming and 
statistics literature addressing the problems of inductive bias leaming and improving generalization 
through multiple task leaming. Some of these approaches can be seen as special cases of, or at least 
closely ahgned with, the model described here, while others are more orthogonal. Without being 
completely exhaustive, in this section we present an overview of the main contributions. See Thran 
and Pratt (1997, chapter 1) for a more comprehensive treatment. 

• Hierarchical Bayes. The earliest approaches to bias leaming come from Hierarchical Bayesian 
methods in statistics (Berger, 1985; Good, 1980; Gelman, Carlin, Stem, & Rubim, 1995). 
In contrast to the Bayesian methodology, the present paper takes an essentially empirical 
process approach to modeling the problem of bias leaming. However, a model using a mixture 
of hierarchical Bayesian and information-theoretic ideas was presented in Baxter (1997a), 
with similar conclusions to those found here. An empirical study showing the utility of the 
hierarchical Bayes approach in a domain containing a large number of related tasks was given 
in Heskes (1998). 
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Early machine learning work. In Rendell, Seshu, and Tcheng (1987) "VBMS" or Variable Bias 
Management System was introduced as a mechanism for selecting amongst different learning 
algorithms when tackling a new learning problem. "STABB" or Shift To a Better Bias (Ut- 
goff, 1986) was another early scheme for adjusting bias, but unlike VBMS, STABB was not 
primarily focussed on searching for bias applicable to large problem domains. Our use of an 
"environment of related tasks" in this paper may also be interpreted as an "environment of 
analogous tasks" in the sense that conclusions about one task can be arrived at by analogy 
with (sufficiently many of) the other tasks. For an early discussion of analogy in this con- 
text, see Russell (1989, S4.3), in particular the observation that for analogous problems the 
sampling burden per task can be reduced. 

Metric-based approaches. The metric used in nearest-neighbour classification, and in vector 
quantization to determine the nearest code-book vector, represents a form of inductive bias. 
Using the model of the present paper, and under some extra assumptions on the tasks in 
the enviroimient (specifically, that their marginal input-space distributions are identical and 
they only differ in the conditional probabiUties they assign to class labels), it can be shown 
that there is an optimal metric or distance measure to use for vector quantization and one- 
nearest-neighbour classification (Baxter, 1995a, 1997b; Baxter &Bartlett, 1998). This metric 
can be learnt by sampling from a subset of tasks from the enviroimient, and then used as a 
distance measure when learning novel tasks drawn from the same environment. Bounds on 
the number of tasks and examples of each task required to ensure good performance on novel 
tasks were given in Baxter and Bartlett (1998), along with an experiment in which a metric 
was successfully trained on examples of a subset of 400 Japanese characters and then used as 
a fixed distance measure when learning 2600 as yet unseen characters. 

A similar approach is described in Thrun and Mitchell (1995), Thrun (1996), in which a 
neural network's output was trained to match labels on a novel task, while simultaneously 
being forced to match its gradient to derivative information generated from a distance metric 
trained on previous, related tasks. Performance on the novel tasks improved substantially 
with the use of the derivative information. 

Note that there are many other adaptive metric techniques used in machine learning, but these 
all focus exclusively on adjusting the metric for a fixed set of problems rather than learning a 
metric suitable for learning novel, related tasks (bias learning). 

Feature learning or learning internal representations. As with adaptive metric techniques, 
there are many approaches to feature learning that focus on adapting features for a fixed task 
rather than learning features to be used in novel tasks. One of the few cases where features 
have been learnt on a subset of tasks with the exphcit aim of using them on novel tasks was 
Intrator and Edelman (1996) in which a low-dimensional representation was learnt for a set 
of multiple related image-recognition tasks and then used to successfully learn novel tasks of 
the same kind. The experiments reported in Baxter (1995a, chapter 4) and Baxter (1995b), 
Baxter and Bartlett (1998) are also of this nature. 

Bias learning in Inductive Logic Programming (ILP). Predicate invention refers to the pro- 
cess in ILP whereby new predicates thought to be useful for the classification task at hand 
are added to the learner's domain knowledge. By using the new predicates as background do- 
main knowledge when learning novel tasks, predicate invention may be viewed as a form of 
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inductive bias learning. Preliminary results with this approach on a chess domain are reported 
in Khan, Muggleton, and Parson (1998). 

• Improving performance on a fixed reference task. "Multi-task learning" (Caruana, 1997) 

trains extra neural network outputs to match related tasks in order to improve generalization 
performance on a fixed reference task. Although this approach does not expUcitly identify the 
extra bias generated by the related tasks in a way that can be used to learn novel tasks, it is 
an example of exploiting the bias provided by a set of related tasks to improve generalization 
performance. Other similar approaches include Suddarth and Kergosien (1990), Suddarth and 
Holden (1991), Abu-Mostafa (1993). 

• Bias as computational complexity. In this paper we consider inductive bias from a sample- 

complexity perspective: how does the learnt bias decrease the number of examples required of 
novel tasks for good generalization? A natural alternative line of enquiry is how the running- 
time or computational complexity of a learning algorithm may be improved by training on 
related tasks. Some early algorithms for neural networks in this vein are contained in Sharkey 
and Sharkey (1993), Pratt (1992). 

• Reinforcement Learning. Many control tasks can appropriately be viewed as elements of sets 

of related tasks, such as learning to navigate to different goal states, or learning a set of 
complex motor control tasks. A number of papers in the reinforcement learning literature 
have proposed algorithms for both sharing the information in related tasks to improve average 
generahzation performance across those tasks Singh (1992), Ring (1995), or learning bias 
from a set of tasks to improve performance on future tasks Sutton (1992), Thrun and Schwartz 
(1995). 

1.2 Overview of the Paper 

In Section 2 the bias learning model is formally defined, and the main sample complexity results 
are given showing the utility of learning multiple related tasks and the feasibility of bias learning. 
These results show that the sample complexity is controlled by the size of certain covering numbers 
associated with the set of all hypothesis spaces available to the bias learner, in much the same way 
as the sample complexity in learning Boolean functions is controlled by the Vapnik-Chervonenkis 
dimension (Vapnik, 1982; Blumer et al., 1989). The results of Section 2 are upper bounds on 
the sample complexity required for good generalization when learning multiple tasks and learning 
inductive bias. 

The general results of Section 2 are specialized to the case of feature learning with neural net- 
works in Section 3, where an algorithm for training features by gradient descent is also presented. 
For this special case we are able to show matching lower bounds for the sample complexity of 
multiple task learning. In Section 4 we present some concluding remarks and directions for future 
research. Many of the proofs are quite lengthy and have been moved to the appendices so as not to 
interrupt the flow of the main text. 

The following tables contain a glossary of the mathematical symbols used in the paper. 
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Symbol 


Description 


First Referenced 


X 


Input Space 


155 


Y 


Output Space 


155 


P 


Distribution on X x y (learning task) 


155 


I 


Loss function 


155 


n 


Hypothesis Space 


155 


h 


Hypotliesis 


155 


eip{h) 


Error of hypothesis h on distribution P 


156 


z 


Training set 


156 


A 


Learning Algorithm 


156 


er^(/i) 


Empirical error of h on training set z 


156 


V 


Set of all learning tasks P 


157 


Q 


Distribution over learning tasks 


157 


H 


Family of hypothesis spaces 


157 




Loss or hypothesis space H on environment Q 


158 


z 


{n, m)-sample 


158 


erz(7{) 


Empirical loss of H on z 


158 


A 


Bias learning algorithm 


159 


hi 


Function induced by h and / 


159 


ni 


Set of hi 


159 


{hi,. . . , hn)i 


Average of hi J,. .. ,hn,i 


159 


hi 


Same as(/ii, . . . , hn)i 


159 


wi 


Set of {hi,..., hn)i 


159 




Set of Wl 


159 


H* 


Function on probability distributions 


160 


H* 


Set of ft 


160 


d-p 


Pseudo-metric on 


160 


dQ 


Pseudo-metric on H* 


160 


Af{E,M*,dQ) 


Covering number of H* 


160 


C{s, H*) 


Capacity of H* 


160 


AA(e,Hf,dp) 


Covering number of H" 


160 




Capacity of 


160 


h 


Sequence of n hypotheses {hi, . . . ,hn) 


163 


P 


Sequence of n distributions (Pi, . . . , P„) 


163 


erp(h) 


Average loss of h on P 


164 


erz(n) 


Average loss of h on z 


164 




Set of feature maps 


166 


y 


Output class composed with feature maps / 


166 




Hypothesis space associated with / 


166 


Qi 


Loss function class associated with Q 


166 


M{e,gi,dp) 


Covering number of Gi 


166 


Cie^Gi) 


Capacity of Qi 


166 


d[p,g{\{i\ /') 


Pseudo-metric on feature maps /, /' 


166 


M{e,J^, dfp,£;,i) 


Covering number of JF 


166 
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Symbol 



Description 



First Referenced 



Covering number of T 
Capacity of T 

Neural network hypothesis space 
T-L restricted to vector x 
Growth function of H 
Vapnik-Chervonenkis dimension of T-L 
H restricted to matrix x 
H restricted to matrix x 
Growth function of H 
Dimension function of H 
Upper dimension function of H 
Lower dimension function of H 
Optimal performance of H" on P 
Metric on M+ 
Average of /ii, . . . , /i„ 
Set of /ii © ■ ■ ■ © hn 
Permutations on integer pairs 
Permuted z 

Empirical li metric on functions h 
Optimal average error of H on P 



166 
166 
167 
172 
172 
172 
173 
173 
173 
173 
173 
173 
175 
179 
179 
180 
182 
182 
182 
185 



Unim) 
VCdimC^) 

IIii (n,m) 
dm(n) 
d{M) 
d{M) 

optp(H") 
du 

/ii © • • • © /i„ 
7^1 © ■ ■ ■ © 




rfz(h,h') 
erp(H) 



2. The Bias Learning Model 

In this section the bias learning model is formally introduced. To motivate the definitions, we first 
describe the main features of ordinary (single-task) supervised leaming models. 

2.1 Single-Task Learning 

Computational leaming theory models of supervised leaming usually include the following ingre- 
dients: 

• An input space X and an output space Y, 

• a probability distribution P on X xY, 

• a loss function I: Y xY ^M, and 

• a hypothesis space T-L which is a set of hypotheses or functions h: X ^Y. 

As an example, if the problem is to leam to recognize images of Mary's face using a neural network, 
then X would be the set of all images (typically represented as a subset of M*^ where each component 
is a pixel intensity), Y would be the set {0, 1}, and the distribution P would be peaked over images 
of different faces and the correct class labels. The learner's hypothesis space % would be a class of 
neural networks mapping the input space M'' to {0, 1}. The loss in this case would be discrete loss: 




(1) 
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Using the loss function allows us to present a unified treatment of both pattern recognition (Y = 
{0, 1}, / as above), and real- valued function learning (e.g. regression) in which y = M and usually 

iiy:y') = iy-y'r- 

The goal of the learner is to select a hypothesis h with minimum expected loss: 

eip{h) := [ lih{x),y)dP{x,y). (2) 

JxxY 

Of course, the learner does not know P and so it cannot search through "H for an h minimizing 
erp{h). In practice, the learner samples repeatedly from XxY according to the distribution P to 
generate a training set 

z := {{xi,yi),...,{xm,ym)}- (3) 

Based on the information contained in z the learner produces a hypothesis h ^ H. Hence, in general 
a learner is simply a map A from the set of all training samples to the hypothesis space H: 

A: \J{X xYr 

m>0 

(stochastic learner's can be treated by assuming a distribution-valued A.) 

Many algorithms seek to minimize the empirical loss of h on z, where this is defined by: 

^ m 

ex,{h) := —S^l{h{xi),yi). (4) 

i=l 

Of course, there are more intelligent things to do with the data than simply minimizing empirical 
error — for example one can add regularisation terms to avoid over-fitting. 

However the learner chooses its hypothesis h, if we have a uniform bound (over all h & H) on 
the probability of large deviation between erz{h) and erp{h), then we can bound the learner's gen- 
eralization error erp(/i) as a function of its empirical loss on the training set evz{h). Whether such 
a bound holds depends upon the "richness" of H. The conditions ensuring convergence between 
evzih) and eip{h) are by now well understood; for Boolean function learning (Y = {0, 1}, discrete 
loss), convergence is controlled by the VC-dimension^ of Ti: 

Theorem 1. Let P be any probability distribution on X x {0, 1} and suppose z = 

{{xi,yi), . . . , {xm,ym)} is generated by sampling m times from X x {0, 1} according to P. Let 
d := VCdim('H). Then with probability at least 1 — 5 (over the choice of the training set z), all 
h & H will satisfy 



erp(/i) < er^(/i) + 



32 / „ 2em , 4 
— dlog— — +log- 
m \ d 



1/2 

(5) 



Proofs of this result may be found in Vapnik (1982), Blumer et al. (1989), and will not be 
reproduced here. 



1. The VC dimension of a class of Boolean functions H is the largest integer d such that there exists a subset S 
{xi, . . . , Xd} C X such that the restriction of "K to 5 contains all 2"* Boolean functions on S. 
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Theorem 1 only provides conditions under which the deviation between evp{h) and erz{h) is 
likely to be small, it does not guarantee that the true error erp{h) will actually be small. This is 
governed by the choice of K. liH contains a solution with small error and the learner minimizes 
error on the training set, then with high probability erp(/i) will be small. However, a bad choice of 
H will mean there is no hope of achieving small error. Thus, the bias of the learner in this model^ 
is represented by the choice of hypothesis space H. 

2.2 The Bias Learning Model 

The main extra assumption of the bias learning model introduced here is that the learner is embed- 
ded in an environment of related tasks, and can sample from the environment to generate multiple 
training sets belonging to multiple different tasks. In the above model of ordinary (single-task) 
learning, a learning task is represented by a distribution P on X x Y . So in the bias learning 
model, an environment of learning problems is represented by a pair [V^ Q) where V is the set of 
all probability distributions on X xY (i.e., V is the set of all possible learning problems), and Q is a 
distribution on V. Q controls which learning problems the learner is likely to see^. For example, if 
the learner is in a face recognition environment, Q will be highly peaked over face-recognition-type 
problems, whereas if the learner is in a character recognition environment Q will be peaked over 
character-recognition-type problems (here, as in the introduction, we view these environments as 
sets of individual classification problems, rather than single, multiple class classification problems). 

Recall from the last paragraph of the previous section that the learner's bias is represented by its 
choice of hypothesis space H. So to enable the learner to learn the bias, we supply it with a family 
or set of hypothesis spaces M := {H}. 

Putting all this together, formally a learning to learn or bias learning problem consists of: 

• an input space X and an output space Y (both of which are separable metric spaces), 

• a loss function I: Y x Y ^R, 

• an environment {V, Q) where V is the set of all probability distributions on X x y and Q is 
a distribution on V, 

• a hypothesis space family M = {H} where each ^ G H is a set of functions h: X ^Y. 

From now on we will assume the loss function I has range [0,1], or equivalently, with rescaling, 
we assume that / is bounded. 



2. The bias is also governed by liow the learner uses the hypothesis space. For example, under some circumstances the 
learner may choose not to use the full power of K (a neural network example is early-stopping). For simplicity in 
this paper we abstract away from such features of the algorithm A and assume that it uses the entire hypothesis space 

n. 

3. Q's domain is a cr-algebra of subsets of V. A suitable one for our purposes is the Borel cr-algebra B{V) generated 
by the topology of weak convergence on "P. If we assume that X and Y are separable metric spaces, then V is also 
a separable metric space in the Prohorov metric (which metrizes the topology of weak convergence) (Parthasarathy, 
1967), so there is no problem with the existence of measures on B{V). See Appendix D for further discussion, 
particularly the proof of part 5 in Lemma 32. 
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We define the goal of a bias learner to be to find a hypothesis space H G H minimizing the 
following loss: 

eiQin)- [ Meip{h)dQ{P) (6) 

= [ inf / l{h{x),y)dPix,y)dQiP). 

The only way eiQ{'H) can be small is if, with high Q-probability, H contains a good solution h to 
any problem P drawn at random according to Q. In this sense exQ{'H) measures how appropriate 
the bias embodied by K is for the environment {V, Q). 

In general the leamer will not know Q, so it will not be able to find an K minimizing exQ{'H) 
directly. However, the leamer can sample from the environment in the following way: 

• Sample n times from V according to Q to yield: 
-Pi, ■ ■ ■ , Pn- 

• Sample m times from X xY according to each Pi to yield: 

• The resulting n training sets — henceforth called an (n, m)-sample if they are generated by the 
above process — are supplied to the leamer. In the sequel, an (n, m)-sample will be denoted 
by z and written as a matrix: 

{Xll,yil) ■■■ {Xlm,yim) = Zi 

z:= ; ••. ; ; (7) 

(^nl,ynl) ■■■ {Xnmiynm) ~ 

An (n,m)-sample is simply n training sets zi,... ,Zn sampled from n different learning tasks 
Pi, . . . , P„, where each task is selected according to the environmental probability distribution Q. 
The size of each training set is kept the same primarily to facilitate the analysis. 

Based on the information contained in z, the leamer must choose a hypothesis space H G H. 
One way to do this would be for the leamer to find an M minimizing the empirical loss on z, where 
this is defined by: 

1 " 

2 = 1 

Note that erz('H) is simply the average of the best possible empirical error achievable on each 
training set Zi, using a function from H. It is a biased estimate of exQ{'H). An unbiased esti- 
mate of eTQ{'H) would require choosing an Ti with minimal average error over the n distributions 
Pi , . . . , P„, where this is defined by ^ inffee^ eip. {h). 

As with ordinary learning, it is likely there are more intelligent things to do with the training data 
z than minimizing (8). Denoting the set of all (ra, m) -samples by {X x y)("'™), a general "bias 
leamer" is a map A that takes (n, m)-samples as input and produces hypothesis spaces H G H as 
output: 

A: [j {X X y)("'™) ^ H. (9) 

n>0 
m>0 
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(as stated, A is a deterministic bias learner, however it is trivial to extend our results to stochastic 
learners). 

Note that in this paper we are concerned only with the sample complexity properties of a bias 
learner A; we do not discuss issues of the computability of A. 

Since A is searching for entire hypothesis spaces H within a family of such hypothesis spaces 
H, there is an extra representational question in our model of bias learning that is not present in 
ordinary learning, and that is how the family H is represented and searched by A. We defer this 
discussion until Section 2.5, after the main sample complexity results for this model of bias learning 
have been introduced. For the specific case of learning a set of features suitable for an environment 
of related learning problems, see Section 3. 

Regardless of how the learner chooses its hypothesis space H, if we have a uniform bound (over 
all H G H) on the probability of large deviation between erz(^) and eiQ{'H), and we can compute 
an upper bound on eiziH), then we can bound the bias learner's "generahzation error" eiQ{'H). 
With this view, the question of generalization within our bias learning model becomes: how many 
tasks (n) and how many examples of each task (m) are required to ensure that evzi'H) and erQ{'H) 
are close with high probability, uniformly over all H G H? Or, informally, how many tasks and how 
many examples of each task are required to ensure that a hypothesis space with good solutions to 
all the training tasks will contain good solutions to novel tasks drawn from the same environment? 

It turns out that this kind of uniform convergence for bias learning is controlled by the "size" 
of certain function classes derived from the hypothesis space family H, in much the same way as 
the VC-dimension of a hypothesis space H controls uniform convergence in the case of Boolean 
function learning (Theorem 1). These "size" measures and other auxihary definitions needed to 
state the main theorem are introduced in the following subsection. 

2.3 Covering Numbers 

Definition 1. For any hypothesis h: X ^ Y, define hi: X xY ^ \^,l]by 

hi{x,y):=l{h{x),y) (10) 

For any hypothesis space T-L in the hypothesis space family H, define 

-Hi -{hi: hen}. (11) 

For any sequence ofn hypotheses {hi, . . . , define {hi, . . . , hn)i : {X x y)" [0, 1] by 

1 " 

{hi, . . . ,hn)i{xi,yi, . . . ,Xn,yn) ■= - y^J{hi{xi),yi). (12) 

1=1 

We will also use h; to denote {hi, . . . , hn)i. For any T-L in the hypothesis space family H, define 

nf:={{hi,...,hn)i:hi,...,hnen}. (13) 

Define 

Mf := \J nf. (14) 
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In the first part of the definition above, hypotheses h: X — )■ y are turned into functions hi 
mapping X x y — >■ [0, 1] by composition with the loss function. Tii is then just the collection of all 
such functions where the original hypotheses come from H. Hi is often called a loss-function class. 
In our case we are interested in the average loss across n tasks, where each of the n hypotheses 
is chosen from a fixed hypothesis space H. This motivates the definition of h; and Hf. Finally, 

is the collection of all {hi, . . . ,hn)i, with the restriction that all hi,. . . ,hn belong to a single 
hypothesis space 71 £ M. 

Definition 2. For each U e M, define H* : V ^ [0, 1] by 

n*{P) := inf eip{h). (15) 

For the hypothesis space family H, define 

W :={%*: U^m]. (16) 

It is the "size" of and H* that controls how large the (n, m)-sample z must be to ensure 
ei'z(^) and eiQ{T-L) are close uniformly over all G H. Their size will be defined in terms of 
certain covering numbers, and for this we need to define how to measure the distance between 
elements of H" and also between elements of H*. 

Definition 3. Let P = (Pi, . . . ,Pn) be any sequence of n probability distributions on X xY. For 
any h; , G H", define 

c?p(h;,hj):= / |h;(a;i, yi, . . . , a;„, ?/„) - hj(a;i, yi, . . . , 

JiXxY)" (17) 

dPi{xi,yi) ...dPn{ 
Similarly, for any distribution Q onV and any 'H\,'H2 ^ ^^I*' define 

dQm,^) := [ \H\{P)-Hl{P)\ dQ{P) (18) 
Jv 

It is easily verified that dp and dq are pseudo-metrics^ on H" and H* respectively. 

Definition 4. An e-cover of (l[*,dQ) is a set {Hl,...,Hlf} such that for all T-L* G H*, 
dqiT-L* ,T-L*) < e for some i = 1 ... TV. Note that we do not require the Ti* to be contained in 
H*, just that they be measurable functions on V. Let Hie, H*, dq) denote the size of the smallest 
such cover Define the capacity ofM.*by 

C{£, H*) := supA/'(£, H*, dg) (19) 
Q 

where the supremum is over all probability measures on V. M{e, H", c?p) is defined in a similar 
way, using dp in place of dq. Define the capacity o/H" by: 

C{e, Mf) := supAf{£, Mf, dp) (20) 
p 

where now the supremum is over all sequences of n probability measures on X x Y. 
4. A pseudo-metric d is a metric witliout the condition that d{x, y) = =^ x = y. 
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2.4 Uniform Convergence for Bias Learners 

Now we have enough machinery to state the main theorem. In the theorem the hypothesis space 

family is required to be permissible. Permissibility is discussed in detail in Appendix D, but note 
that it is a weak measure-theoretic condition satisfied by almost all "real-world" hypothesis space 
families. All logarithms are to base e. 

Theorem 2. Suppose X and Y are separable metric spaces and let Q be any probability distri- 
bution on V, the set of all distributions on X x Y. Suppose z is an {n,m)-sample generated by 
sampling n times from V according to Q to give Pi, ... , P„, and then sampling m times from each 
Pi to generate zi = {{xn^yn), . . . , {xim, yim)}> * = 1, • • ■ , ?i- Let H = {T-L} be any permissible 
hypothesis space family. If the number of tasks n satisfies 

f256 8C (t^,H*) _ , 
n>max<^^log— ^, ^ S , (21) 



and the number of examples m of each task satisfies 



[256, 8C(^,H") 641 
m>max<^— log ^^2' I (22) 

then with probability at least 1 — 6 (over the (n, m)-sample z), aWH ^ M will satisfy 

eiQin) < ei^in) + e (23) 
Proof. See Appendix A. □ 
There are several important points to note about Theorem 2: 

1. Provided the capacities C (e, H*) and C(e, H") are finite, the theorem shows that any bias 
learner that selects hypothesis spaces from HI can bound its generalisation error erQ('H) in 
terms of erz(^) for sufficiently large (n, m)-samples z. Most bias learner's will not find the 
exact value of erz(H) because it involves finding the smallest error of any hypothesis h 

on each of the n training sets in z. But any upper bound on erz(H) (found, for example 
by gradient descent on some error function) will still give an upper bound on evQ{'H). See 
Section 3.3.1 for a brief discussion on how this can be achieved in a feature learning setting. 

2. In order to learn bias (in the sense that eiQ[T-L) and erz{'H) are close uniformly over all 

G H), both the number of tasks n and the number of examples of each task m must 
be sufficiently large. This is intuitively reasonable because the bias learner must see both 
sufficiently many tasks to be confident of the nature of the environment, and sufficiently 
many examples of each task to be confident of the nature of each task. 

3. Once the leamer has found an H & M with a small value of erz(H), it can then use H to 
leam novel tasks P drawn according to Q. One then has the following theorem bounding the 
sample complexity required for good generalisation when learning with T-L (the proof is very 
similar to the proof of the bound on m in Theorem 2). 
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Theorem 3. Let z = {(si , yi), . . . , (x^, Um)} be a training set generated by sampling from 
X X Y according to some distribution P. Let T-ibe a permissible hypothesis space. For all 
£, 5 with < e, 5 < 1, if the number of training examples m satisfies 

f64 ^Ci^.Hi) 16 1 
m>max<^^log ^^"^ 

then with probability at least 1 — 6, all h € H will satisfy 

eip{h) < eizih) + e. 

The capacity C (e, H) appearing in equation (24) is defined in an analogous fashion to the 
capacities in Definition 4 (we just use the pseudo-metric dp{hi,h'i) := fxxY l^ii^^v) ~ 
h'i{x, y) \ dP{x, y)). The important thing to note about Theorem 3 is that the number of ex- 
amples required for good generalisation when learning novel tasks is proportional to the log- 
arithm of the capacity of the learnt hypothesis space Ti. In contrast, if the learner does not 
do any bias learning, it will have no reason to select one hypothesis space K € M over any 
other and consequently it would have to view as a candidate solution any hypothesis in any 
of the hypothesis spaces H € M. Thus, its sample complexity will be proportional to the 
capacity of U^gii{^;} = Hp which in general will be considerably larger than the capacity 
of any individual G H. So by learning H the learner has learnt to learn in the environment 
(P, Q) in the sense that it needs far smaller training sets to learn novel tasks. 

Having learnt a hypothesis space "H with a small value of erz('H), Theorem 2 tells us that 
with probability at least 1 — 8, the expected value of inf/jg^ erp(/i) on a novel task P will be 
less than erz(^) + £• Of course, this does not rule out really bad performance on some tasks 
P. However, the probability of generating such "bad" tasks can be bounded. In particular, 
note that eTQ{T-L) is just the expected value of the function Ti* over V, and so by Markov's 
inequality, for 7 > 0, 



PxIp: inf erp(/i) > 7I = Pr{P: H*(P) > 7} 



7 

^ erQ(H) 

7 

< ' (with probabiUty 1 - 6). 

7 

5. Keeping the accuracy and confidence parameters e, S fixed, note that the number of examples 
required of each task for good generalisation obeys 

m = o(^ilogC(e,Hn) . (25) 

So provided logC (e, H") increases sublinearly with n, the upper bound on the number of 

examples required of each task will decrease as the number of tasks increases. This shows 
that for suitably constructed hypothesis space families it is possible to share information 
between tasks. This is discussed further after Theorem 4 below. 
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2.5 Choosing the Hypothesis Space Family H. 

Theorem 2 only provides conditions under which ev^il-L) and erQ('H) are close, it does not guaran- 
tee that eiQ{'H) is actually small. This is govemed by the choice of H. If H contains a hypothesis 
space M with a small value of eiqiT-L) and the leamer is able to find an G H minimizing error on 
the (n, m) sample z (i.e., minimizing erz('H)), then, for sufficiently large n and m. Theorem 2 en- 
sures that with high probability eiQ{'H) will be small. However, a bad choice of H will mean there 
is no hope of finding an H with small error. In this sense the choice of H represents the hyper-bias 
of the learner. 

Note that from a sample complexity point of view, the optimal hypothesis space family to choose 
is one containing a single, minimal hypothesis space H that contains good solutions to all of the 
problems in the environment (or at least a set of problems with high Q-probability), and no more. 
For then there is no bias learning to do (because there is no choice to be made between hypothesis 
spaces), the output of the bias learning algorithm is guaranteed to be a good hypothesis space for 
the environment, and since the hypothesis space is minimal, learning any problem within the en- 
vironment using H will require the smallest possible number of examples. However, this scenario 
is analagous to the trivial scenario in ordinary learning in which the learning algorithm contains a 
single, optimal hypothesis for the problem being learnt. In that case there is no learning to be done, 
just as there is no bias learning to be done if the correct hypothesis space is aheady known. 

At the other extreme, if H contains a single hypothesis space H consisting of all possible func- 
tions from X ^ Y then bias learning is impossible because the bias leamer cannot produce a 
restricted hypothesis space as output, and hence cannot produce a hypothesis space with improved 
sample complexity requirements on as yet unseen tasks. 

Focussing on these two extremes highhghts the minimal requirements on H for successful bias 
learning to occur: the hypothesis spaces H & M must be strictly smaller than the space of all 
functions X ^ Y, but not so small or so "skewed" that none of them contain good solutions to a 
large majority of the problems in the environment. 

It may seem that we have simply replaced the problem of selecting the right bias (i.e., selecting 
the right hypothesis space H) with the equally difficult problem of selecting the right hyper-bias (i.e., 
the right hypothesis space family H). However, in many cases selecting the right hyper-bias is far 
easier than selecting the right bias. For example, in Section 3 we will see how the feature selection 
problem may be viewed as a bias selection problem. Selecting the right features can be extremely 
difficult if one knows little about the environment, with intelligent trial-and-error typically the best 
one can do. However, in a bias learning scenario, one only has to specify that a set of features should 
exist, find a loosely parameterised set of features (for example neural networks), and then learn the 
features by samphng from multiple related tasks. 

2.6 Learning Multiple Tasks 

It may be that the learner is not interested in learning to learn, but just wants to learn a fixed set 
of n tasks from the environment {V, Q). As in the previous section, we assume the leamer starts 
out with a hypothesis space family H, and also that it receives an (n, m)-sample z generated from 
the n distributions Pi, . . . , This time, however, the learner is simply looking for n hypotheses 
(/ti, . . . , hn), all contained in the same hypothesis space H, such that the average generalization 
error of the n hypotheses is minimal. Denoting {hi, ... , hn) by h and writing P = (Pi, . . . , P„), 
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this error is given by: 

n 

erp(h) := - Verp,(/i,) (26) 
n ' 

1=1 

1 " f 

= -V/ l{hi{x),y)dPi{x,y), 

and the empirical loss of h on z is 

1 " 

erz(h) := -5]er,,(/i,) (27) 

^ n m 

= -5Z-I]^(^^(^^J•)'y^J•)■ 
^=l j=l 

As before, regardless of how the learner chooses [hi, . . . , /t„), if we can prove a uniform bound on 
the probability of large deviation between erz(h) and erp(h) then any {hi, . . . , h^) that perform 
well on the training sets z will with high probability perform well on future examples of the same 
tasks. 

Theorem 4. Let P = (Pi, . . . , P„) be n probability distributions on X xY and let z be an (n, m)- 
sample generated by sampling m times from X x Y according to each Pi. Let H = {%} be any 
permissible hypothesis space family. If the number of examples m of each task satisfies 

( 64 4C(i^,H") 161 
m > max <^ ^ log — — I (28) 

then with probability at least 1 — 6 (over the choice ofz), any h € H" will satisfy 

erp(h) < erz(h) + e (29) 
(recall Definition 4 for the meaning ofC{e, EI")). 

Proof. Omitted (follow the proof of the bound on m in Theorem 2). □ 

The bound on m in Theorem 4 is virtually identical to the bound on m in Theorem 2, and note 
again that it depends inversely on the number of tasks n (assuming that the first part of the "max" 
expression is the dominate one). Whether this helps depends on the rate of growth of H") as 
a function of n. The following Lemma shows that this growth is always small enough to ensure that 
we never do worse by learning multiple tasks (at least in terms of the upper bound on the number of 
examples required per task). 

Lemma 5. For any hypothesis space family H, 

C (e,H/) < C(e,Hf) < C (e,H;i)". (30) 
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Proof. Let K denote the set of all functions (/ii, . . . , where each hi can be a member of any 
hypothesis space ^ G H (recall Definition 1). Then (1 K and so C (e, Hp < C (e, K). By 
Lemma 29 in Appendix B, C (e, K) <C (e, H^^)" and so the right hand inequality follows. 

For the first inequality, let P be any probability measure on X x y and let P be the mea- 
sure on {X X y)" obtained by using P on the first copy of X x y in the product, and ignoring 
all other elements of the product. Let N be an e-cover for (E[",(ip). Pick any hi G H^^ and 
let (flfi, . . . , gn)i € N he such that dp {{h, h, . . . , h)i, (gi, . . . , gn)i) < £• But by construction, 
dp {{h,h, . . . ,h)i, {gi, . . . ,gn)i) = dp{h, {gi)i), which estabUshes the first inequality. □ 

By Lemma 5 

logC (£,H/) < logC{e,Mf) < nlogC (£,H/) . (31) 

So keeping the accuracy parameters e and S fixed, and plugging (31) into (28), we see that the upper 
bound on the number of examples required of each task never increases with the number of tasks, 
and at best decreases as 0(1 /n). Although only an upper bound, this provides a strong hint that 
learning multiple related tasks should be advantageous on a "number of examples required per task" 
basis. In Section 3 it will be shown that for feature learning all types of behavior are possible, from 
no advantage at all to 0(l/n) decrease. 

2.7 Dependence on e 

In Theorems 2, 3 and 4 the bounds on sample complexity all scale as This behavior can be 

improved to 1/e if the empirical loss is always guaranteed to be zero (i.e., we are in the realizable 
case). The same behavior results if we are interested in relative deviation between empirical and 
true loss, rather than absolute deviation. Formal theorems along these Unes are stated in Appendix 
A.3. 

3. Feature Learning 

The use of restricted feature sets is nearly ubiquitous as a method of encoding bias in many areas of 
machine learning and statistics, including classification, regression and density estimation. 

In this section we show how the problem of choosing a set of features for an environment of 
related tasks can be recast as a bias learning problem. Exphcit bounds on C{M.*,e) and C(]H[", e) 
are calculated for general feature classes in Section 3.2. These bounds are applied to the problem of 
learning a neural network feature set in Section 3.3. 

3.1 The Feature Learning Model 

Consider the following quote from Vapnik (1996): 

The classical approach to estimating multidimensional functional dependencies is 
based on the following belief: 

Real-life problems are such that there exists a small number of "strong features," simple 
functions of which (say linear combinations) approximate well the unknown function. 
Therefore, it is necessary to carefully choose a low-dimensional feature space and then 
to use regular statistical techniques to construct an approximation. 
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In general a set of "strong features" may be viewed as a function f : X ^ V mapping the input 
space X into some (typically lower) dimensional space V. Let = {/} be a set of such feature 
maps (each / may be viewed as a set of features {fi,...,fi.)ifV = M!'). It is the / that must be 
"carefully chosen" in the above quote. In general, the "simple functions of the features" may be 
represented as a class of functions Q mapping F to y. If for each / G we define the hypothesis 
space G o f ■= {g o f ■ g & G}, then we have the hypothesis space family H 

M-iGof-.feJ'}. (32) 

Now the problem of "carefully choosing" the right features / is equivalent to the bias leaming 
problem "find the right hypothesis space M G H". Hence, provided the learner is embedded within 
an environment of related tasks, and the capacities C(EI*, e) and C{Mf, e) are finite. Theorem 2 tells 
us that the feature set / can be learnt rather than carefully chosen. This represents an important 
simphfication, as choosing a set of features is often the most difficult part of any machine leaming 
problem. 

In Section 3.2 we give a theorem bounding C(EI*, e) and C(]Hip, e) for general feature classes. 
The theorem is specialized to neural network classes in Section 3.3. 

Note that we have forced the function class G to be the same for all feature maps /, although 
this is not necessary. Indeed variants of the results to follow can be obtained if G is allowed to vary 
with /. 

3.2 Capacity Bounds for General Feature Classes 

Notationally it is easier to view the feature maps / as mapping from X x y to F x y by (a;, y) i-^ 
{f{x),y), and also to absorb the loss function / into the definition of G by viewing each g G 5 as a 
map from V xY into [0, 1] via {v, y) i-)- l{g{v),y). Previously this latter function would have been 
denoted gi but in what follows we will drop the subscript / where this does not cause confusion. The 
class to which gi belongs will still be denoted by Gi- 

With the above definitions let ^( o ^ := {g o f : g g Gi,f G Define the capacity of Gi in 
the usual way, 

C (e, Gi) ■■= sup A/" (e, Gi,dp) 
p 

where the supremum is over all probability measures on F x y, and dp{g, g') := /y^y v) ~ 
g'{v,y) \ dP{v,y). To define the capacity of T we first define a pseudo-metric c?[p,g,] on !F by 
"pulling back" the metric on M through Gi as follows: 

d[p,gi]{f,f') ■■= sup\go f{x,y) - go f'{x,y)\dP{x,y). (33) 

JxxY geSi 

It is easily verified that d^p g^■] is a pseudo-metric. Note that for c?[p^0,] to be well defined the supre- 
mum over Gi in the integrand must be measurable. This is guaranteed if the hypothesis space family 
H = o / : / G .?^} is permissible (Lemma 32, part 4). Now define A/^(e, d^p^g^j) to be the 
smallest e-cover of the pseudo-metric space (JF, d^p^g^^^) and the e-capacity of JF (with respect to Gi) 
as 

Cgi(e,T) ■.= supJ\f{e,J^,dip_Qi]) 
p 

where the supremum is over all probabihty measures on X x y . Now we can state the main theorem 
of this section. 
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Theorem 6. Let Mbe a hypothesis space family as in equation (32). Then for all e,ei,e2 > with 

e = El + £2. 

C{e,m^)<C{ei,gir Cg,{e2,T) (34) 
C{e,M*) <Cg,{e,J^) (35) 

Proof See Appendix B. □ 
3.3 Learning Neural Network Features 

In general, a set of features may be viewed as a map from the (typically high-dimensional) input 
space to a much smaller dimensional space (k ^ d). In this section we consider approximat- 
ing such a feature map by a one-hidden-layer neural network with d input nodes and k output nodes 
(Figure 1). We denote the set of all such feature maps by = {(t)w,i, ■ ■ ■ , 4>w,k) '■ w G D} where 
D is £L bounded subset of (W is the number of weights (parameters) in the first two layers) . 
This set is the JF of the previous section. 

Each feature : M*^ — ^ [0, 1], « = 1, . . . , A; is defined by 

(l>w,i{^) ■= o- I '^Vijhj{x) +Vii+i I (36) 

where hj{x) is the output of the jth node in the first hidden layer, {vn, . . . , vu^i) are the output 
node parameters for the iih feature and cr is a "sigmoid" squashing function cr : M ^ [0, 1]. Each 
first layer hidden node hi: W'' ^ M, i = 1, . . . ,1, computes 

/ , \ 
hi{x) := a '^UijXj + itjd+i (37) 

where {un, .... u-id+i) are the hidden node's parameters. We assume a is Lipschitz.^ The weight 
vector for the entire feature map is thus 

w = {un, ■ ■ ■ ,1^1^+1, ■ ■ -^uii, . . .,uid^i,vu, . . . . . ■ ,Vki, . . ■ ,Vki+i) 

and the total number of feature parameters W = l{d + 1) + k{l + 1). 

For argument's sake, assume the "simple functions" of the features (the class Q of the previous 
section) are squashed affine maps using the same sigmoid function a above (in keeping with the 
"neural network" flavor of the features). Thus, each setting of the feature weights w generates a 
hypothesis space: 



■= |cr ^^aj^iu^i + Ofc+i^ : («!,..., Ofc+i) G D'^ , (38) 

where D' is a bounded subset of M*^+^ . The set of all such hypothesis spaces, 

M:= {n^: w e D} (39) 



5. a is Lipschitz if there exists a constant K such that \a{x) — <7{x')\ < K\x — x'\ for all x, x' € K. 



167 



Baxter 



Multiple Output Classes 





Feature 
Map 




Input 



Figure 1: Neural network for feature learning. The feature map is implemented by the first two 
hidden layers. The n output nodes correspond to the n different tasks in the (n, m)- 
sample z. Each node in the network computes a squashed linear function of the nodes in 
the previous layer. 



is a hypothesis space family. The restrictions on the output layer weights (ai , . . . , a^+i) and feature 
weights w, and the restriction to a Lipschitz squashing function are needed to obtain finite upper 
bounds on the covering numbers in Theorem 2. 

Finding a good set of features for the environment {V, Q) is equivalent to finding a good hy- 
pothesis space G H, which in turn means finding a good set of feature map parameters w. 

As in Theorem 2, the correct set of features may be learnt by finding a hypothesis space with 
small error on a sufficiently large (n,m) -sample z. Specializing to squared loss, in the present 
framework the empirical loss of "Hw on z (equation (8)) is given by 

X -,2 



erz(Hi„) = - ^ inf — 



(40) 



{ao,ai,...,ak)&l 

Since our sigmoid function a only has range [0, 1], we also restrict the outputs Y to this range. 
3.3.1 Algorithms for Finding a Good Set of Features 

Provided the squashing function a is differentiable, gradient descent (with a small variation on 
backpropagation to compute the derivatives) can be used to find feature weights w minimizing (40) 
(or at least a local minimum of (40)). The only extra difficulty over and above ordinary gradient 
descent is the appearance of "inf in the definition of erz(H^). The solution is to perform gradient 
descent over both the output parameters (ao, . . . , a^;) for each node and the feature weights w. For 
more details see Baxter (1995b) and Baxter (1995a, chapter 4), where empirical results supporting 
the theoretical results presented here are also given. 
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3.3.2 Sample Complexity Bounds for Neural-Network Feature Learning 

The size of z ensuring that the resulting features will be good for learning novel tasks from the same 
environment is given by Theorem 2. All we have to do is compute the logarithm of the covering 
numbers C(e, Mf) and C(e, H*). 

Theorem 7. Let H = {T-Lw ■ w G ^ be a hypothesis space family where each T-Lw is of the form 



where = . . . , 4)w,k) is « neural network with W weights mapping from M'' to . If the 

feature weights w and the output weights ao, ai, . . . , cnjt are bounded, the squashing function a is 
Lipschitz, I is squared loss, and the output space y = [0, 1] (any bounded subset o/M will do), then 
there exist constants k, k' (independent of £,W and k) such that for all e > 0, 

\ogC{e,MV) <2{{k + l)n + W)\og- (41) 

£ 

logC(e,H*) < 2W^log— (42) 
( recall that we have specialized to squared loss here). 

Proof See Appendix B. □ 



Noting that our neural network hypothesis space family H is permissible, plugging (41) and (42) 
into Theorem 2 gives the following theorem. 

Theorem 8. Let M = {Hy,} be a hypothesis space family where each hypothesis space 71^ is a 
set of squashed linear maps composed with a neural network feature map, as above. Suppose the 
number of features is k, and the total number of feature weights is W. Assume all feature weights and 
output weights are bounded, and the squashing function a is Lipschitz. Let z be an (ra, m)-sample 
generated from the environment [V, Q). If 



W\og- + log ^ 

e 



(43) 



and 



m>0[^ 



. 1 1, 1 

+ 1 + — log - + - log - 
n / en o 



(44) 



then with probability at least 1 — 6 any T-Lw G EI will satisfy 



(45) 
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3.3.3 Discussion 

1 . Keeping the accuracy and confidence parameters e and 6 fixed, the upper bound on the number 

of examples required of each task behaves hke 0{k + W/n). If the learner is simply learning 
n fixed tasks (rather than leaming to learn), then the same upper bound also apphes (recall 
Theorem 4). 

2. Note that if we do away with the feature map altogether then W = and the upper bound on 
m becomes 0{k), independent of n (apart from the less important S term). So in terms of the 
upper bound, leaming n tasks becomes just as hard as leaming one task. At the other extreme, 
if we fix the output weights then effectively k = and the number of examples required of 
each task decreases as 0{W/n). Thus a range of behavior in the number of examples required 
of each task is possible: from no improvement at all to an 0(l/n) decrease as the number of 
tasks n increases (recall the discussion at the end of Section 2.6). 

3. Once the feature map is leamt (which can be achieved using the techniques outlined in Baxter, 
1995b; Baxter & Bartlett, 1998; Baxter, 1995a, chapter 4), only the output weights have to be 
estimated to leam a novel task. Again keeping the accuracy parameters fixed, this requires no 
more that 0{k) examples. Thus, as the number of tasks leamt increases, the upper bound on 
the number of examples required of each task decays to the minimum possible, 0{k). 

4. If the "small number of strong features" assumption is correct, then k will be small. However, 
typically we will have very little idea of what the features are, so to be confident that the neural 
network is capable of implementing a good feature set it will need to be very large, implying 
W ^ k. 0{k + W/n) decreases most rapidly with increasing n when W ^ k,so at least in 
terms of the upper bound on the number of examples required per task, leaming small feature 
sets is an ideal application for bias learning. However, the upper bound on the number of 
tasks does not fare so well as it scales as O(l^). 

3.3.4 Comparison WITH Traditional Multiple-Class Classification 

A special case of this multi-task framework is one in which the marginal distribution on the input 
space P^x is the same for each task i = I, . . . ,n, and all that varies between tasks is the conditional 
distribution over the output space Y. An example would be a multi-class problem such as face 
recognition, in which Y = {!,... ,n} where n is the number of faces to be recognized and the 
marginal distribution on X is simply the "natural" distribution over images of those faces. In that 
case, if for every example Xij we have — in addition to the sample y,jj from the ith task's conditional 
distribution on Y — samples from the remaining n — I conditional distributions on Y, then we can 
view the n training sets containing m examples each as one large training set for the multi-class 
problem with mn examples altogether. The bound on m in Theorem 8 states that mn should be 
0{nk + W), or proportional to the total number of parameters in the network, a result we would 
expect from^ (Haussler, 1992). 

So when specialized to the traditional multiple-class, single task framework. Theorem 8 is con- 
sistent with the bounds already known. However, as we have already argued, problems such as face 
recognition are not really single-task, multiple-class problems. They are more appropriately viewed 

6. If each example can be classified with a "large margin" then naive parameter counting can be improved upon (Bartlett, 
1998). 
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as a (potentially infinite) collection of distinct binary classification problems. In that case, the goal 
of bias learning is not to find a single n-output network that can classify some subset of n faces 
well. It is to learn a set of features that can reliably be used as a fixed preprocessing for distinguish- 
ing any single face from other faces. This is the new thing provided by Theorem 8: it tells us that 
provided we have trained our n-output neural network on sufficiently many examples of sufficiently 
many tasks, we can be confident that the common feature map learnt for those n tasks will be good 
for learning any new, as yet unseen task, provided the new task is drawn from the same distribution 
that generated the training tasks. In addition, leaming the new task only requires estimating the k 
output node parameters for that task, a vastly easier problem than estimating the parameters of the 
entire network, from both a sample and computational complexity perspective. Also, since we have 
high confidence that the learnt features will be good for learning novel tasks drawn from the same 
environment, those features are themselves a candidate for further study to leam more about the 
nature of the environment. The same claim could not be made if the features had been leamt on too 
small a set of tasks to guarantee generalization to novel tasks, for then it is likely that the features 
would implement idiosyncrasies specific to those tasks, rather than "invariances" that apply across 
all tasks. 

When viewed from a bias (or feature) leaming perspective, rather than a traditional n-class 
classification perspective, the bound rn on the number of examples required of each task takes on 
a somewhat different meaning. It tells us that provided n is large (i.e., we are collecting examples 
of a large number tasks), then we really only need to collect a few more examples than we would 
otherwise have to collect if the feature map was already known {k + W/n examples vs. k examples). 
So it tells us that the burden imposed by feature leaming can be made negligibly small, at least when 
viewed from the perspective of the sampling burden required of each task. 

3.4 Learning Multiple Tasks with Boolean Feature Maps 

Ignoring the accuracy and confidence parameters e and 5, Theorem 8 shows that the number of 
examples required of each task when leaming n tasks with a common neural-network feature map 
is bounded above by 0{k + W/n), where k is the number of features and W is the number of 
adjustable parameters in the feature map. Since 0{k) examples are required to learn a single task 
once the tme features are known, this shows that the upper bound on the number of examples 
required of each task decays (in order) to the minimum possible as the number of tasks n increases. 
This suggests that leaming multiple tasks is advantageous, but to be traly convincing we need to 
prove a lower bound of the same form. Proving lower bounds in a real-valued setting (Y = M) 
is complicated by the fact that a single example can convey an infinite amount of information, so 
one typically has to make extra assumptions, such as that the targets y G y are corrupted by a 
noise process. Rather than concem ourselves with such compUcations, in this section we restrict 
our attention to Boolean hypothesis space families (meaning each hypothesis /i G maps to 
Y = {±1} and we measure error by discrete loss l{h{x),y) = 1 if h{x) / y and l{h{x),y) = 
otherwise). 

We show that the sample complexity for leaming n tasks with a Boolean hypothesis space family 
H is controlled by a "VC dimension" type parameter dm {n) (that is, we give nearly matching upper 
and lower bounds involving dm {n)). We then derive bounds on dm {n) for the hypothesis space 
family considered in the previous section with the Lipschitz sigmoid function a replaced by a hard 
threshold (linear threshold networks). 
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As well as the bound on the number of examples required per task for good generalization across 
those tasks, Theorem 8 also shows that features performing well on 0{W) tasks will generalize well 
to novel tasks, where W is the number of parameters in the feature map. Given that for many feature 
learning problems W is likely to be quite large (recall Note 4 in Section 3.3.3), it would be useful 
to know that 0{W) tasks are in fact necessary without further restrictions on the environmental 
distributions Q generating the tasks. Unfortunately, we have not yet been able to show such a lower 
bound. 

There is some empirical evidence suggesting that in practice the upper bound on the number of 
tasks may be very weak. For example, in Baxter and Bartlett (1998) we reported experiments in 
which a set of neural network features learnt on a subset of only 400 Japanese characters turned out 
to be good enough for classifying some 2600 unseen characters, even though the features contained 
several hundred thousand parameters. Similar results may be found in Intrator and Edelman (1996) 
and in the experiments reported in Thrun (1996) and Thrun and Pratt (1997, chapter 8). While 
this gap between experiment and theory may be just another example of the looseness inherent in 
general bounds, it may also be that the analysis can be tightened. In particular, the bound on the 
number of tasks is insensitive to the size of the class of output functions (the class G in Section 3. 1), 
which may be where the looseness has arisen. 

3.4.1 Upper and Lower Bounds for Learning n Tasks with Boolean Hypothesis 
Space Families 

First we recall some concepts from the theory of Boolean function leaming. Let Khe a class of 
Boolean functions on X and x = {xi, . . . ,Xm) ^ X™-. is the set of all binary vectors obtainable 
by applying functions vaT-LXo x: 

'H\^ := {{h{xi), h{xm)) : h^U). 

Clearly \'H\x\ < 2™. If \'H\x\ = 2™ we say % shatters x. The growth function of % is defined by 

n-^(m) := max \'H\x\ ■ 
xex™ ' ' ' 

The Vapnik-Chervonenkis dimension VCdim(H) is the size of the largest set shattered by Ti: 



VCdim(H) := max{m : n^(m) = 2™}. 



An important result in the theory of leaming Boolean functions is Sauer's Lemma (Sauer, 1972), of 
which we will also make use. 

Lemma 9 (Sauer's Lemma). For a Boolean function class T-i with VCdim(H) = d, 

^«(".)^^:(T)^(=)^ 

for all positive integers m. 

We now generalize these concepts to leaming n tasks with a Boolean hypothesis space family. 
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Definition 5. Let M be a Boolean hypothesis space family. Denote the n x m matrices over the 
input space X by X^"'™^. For each x G X^"'™^ and H G H, define to be the set of (binary) 
matrices, 

hi{xii) ■■■ hi{xim) 



Define 



^71 (^rjm) 



: hi,...,hn en} . 



Now for each n > 0, m > 0, define IIii (n, m) by 

nii(ri,m):= max |lHI|x| ■ 

A^o?e that Um (n, m) < 2"™. Tjf | ]H|x | = 2"™ we say H shatters matrix x. For eac/j n > let 

dm{n) := max{m: nii(ri, m) = 2"™}. 



Define 



d(H) : = VCdim(Hi) and 
d(H) : = maxVCdim(H). 



Lemma 10. 



d(H) > d{m) 

c?]Hi(n) > max ^ 



c/(H) 



n 



d(H) 



+ ci(H)j 



Proof. The first inequality is trivial from the definitions. To get the second term in the maximum 
in the second inequality, choose an "H G EI with VCdim('H) = d(M) and construct a matrix 
X G whose rows are of length d{M.) and are shattered by %. Then clearly H shatters x. For 

the first term in the maximum take a sequence x = (xi , . . . , a^^^jgj^ ) shattered by (the hypothesis 
space consisting of the union over all hypothesis spaces from H), and distribute its elements equally 
among the rows of x (throw away any leftovers). The set of matrices 



h{xii) 



h{Xnl) 



h{xim) 



: /i G 



where m = [d{M)/n\ is a subset of M^-^ and has size 2^ 
Lemma 11. 

Um{n,m) < 



□ 



em 



dmin) 



nduin) 
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Proof. Observe that for each n, nii(n, m) = 11^ (nm) where T-L is the collection of all Boolean 
functions on sequences . . . , a:„„, obtained by first choosing n functions hi,. . . ,hn from some 
K € M, and then applying hi to the first m examples, /12 to the second m examples and so on. By 
the definition of dm (n), VCdim(H) = ndm (n), hence the result follows from Lemma 9 appUed to 

n. □ 

If one follows the proof of Theorem 4 (in particular the proof of Theorem 18 in Appendix 
A) then it is clear that for all e > 0, C{M.f, e) may be replaced by Ilm{n, 2m) in the Boolean 
case. Making this replacement in Theorem 18, and using the choices of a, v from the discussion 
following Theorem 26, we obtain the following bound on the probabiUty of large deviation between 
empirical and true performance in this Boolean setting. 

Theorem 12. Let P = (Pi, . . . , P„) be n probability distributions on X x {±1} and let z be an 
(n, m)-sample generated by sampling m times from X x {±1} according to each Pi. Let H = {%} 
be any permissible Boolean hypothesis space family. For all < e < 1, 



Pr{z: 3h G H" : erp(h) > erz(h) + e} < 4Um{n,2m) expi-e^nm/QA). 



(46) 



Corollary 13. Under the conditions of Theorem 12, if the number of examples m of each task 
satisfies 



m > 



22 1 4 

2dM{n) log h - log - 

end 



then with probability at least 1 — 5 (over the choice ofz), any h G H" will satisfy 

erp(h) < erz(h) + e 
Proof. Applying Theorem 12, we require 

4nH(n,2m)exp(-e^nm/64) < 6, 

which is satisfied if 



(47) 



(48) 



64 

m> 



, , , , 2em 1 , 4 
dm [n) log , , + - log - 
dm{n) n 6 



(49) 



where we have used Lemma 11. Now, for all a > 1, if 



m 



1 + ^ j alog M + ^ ) a, 



then m > a log m. So setting a = 6Mm {n) /e^, (49) is satisfied if 

22 1 4 

m > — 2dm (n) log h - log - 

end 



□ 
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Corollary 13 shows that any algorithm learning n tasks using the hypothesis space family 
requires no more than 



m = O I ^ 



dm (n) log - + - log \ 
end 



(50) 



examples of each task to ensure that with high probability the average true error of any n hypotheses 
it selects from is within e of their average empirical error on the sample z. We now give a 
theorem showing that if the learning algorithm is required to produce n hypotheses whose average 
true error is within e of the best possible error (achievable using M.'^) for an arbitrary sequence of 
distributions Pi, . . . , P„, then within a log ^ factor the number of examples in equation (50) is also 
necessary. 

For any sequence P = (Pi, . . . , P„) of n probabiUty distributions on X x {±1}, define 
optp(H'') by 

optp(H") := inf erp(h). 

hGH" 

Theorem 14. Let M be a Boolean hypothesis space family such that contains at least two 
functions. For each n = 1,2, let An be any learning algorithm taking as input {n, m)-samples 
z G (X X {zblD^*^'™) and producing as output n hypotheses h = {hi, . . . , hn) G H". For all 
0<e<l/64andO<S < 1/64, if 



1 

m < 



-6T6- + ^'-^^n^°H85(l-25) 



then there exist distributions P = (Pi, . . . ,P„) such that with probability at least S (over the 
random choice ofz), 

erp(^„(z)) > optp(H") +£ 
Proof See Appendix C □ 

3.4.2 Linear Threshold Networks 

Theorems 13 and 14 show that within constants and a log(l/£) factor, the sample complexity of 
learning n tasks using the Boolean hypothesis space family H is controlled by the complexity pa- 
rameter dm {n) . In this section we derive bounds on dm {n) for hypothesis space families constructed 
as thresholded linear combinations of Boolean feature maps. Specifically, we assume H is of the 
form given by (39), (38), (37) and (36), where now the squashing function a is replaced with a hard 
threshold: 

[l if2;>0, 
I —1 otherwise, 



a{x) 



and we don't restrict the range of the feature and output layer weights. Note that in this case the 
proof of Theorem 8 does not carry through because the constants k,k' in Theorem 7 depend on the 
Lipschitz bound on a. 

Theorem 15. Let Mbe a hypothesis space family of the form given in (39), (38), (37) and (36), with 

a hard threshold sigmoid function a. Recall that the parameters d, I and k are the input dimension, 
number of hidden nodes in the feature map and number of features ( output nodes in the feature map) 
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respectively. Let W := l{d + 1) + k{l + 1) (the number of adjustable parameters in the feature 
map). Then, 

c?H(n) < 2 ( — + A; + 1 ) logs (2e(A; + / + 1)) . 



n 



Proof. Recall that for each w G M*^' , : M" W denotes the feature map with parameters w. 
For each x G X^"'™^ let ^ya\x denote the matrix 

w \^lm ) 
^nl ) ^nm ) 

Note that ]H|x is the set of all binary n x m matrices obtainable by composing thresholded Unear 
functions with the elements of x' with the restriction that the same function must be applied to 
each element in a row (but the functions may differ between rows). With a slight abuse of notation, 
define 

n$(n,m):= max I {$^|x : if G M^^ } I . 

Fix X € . By Sauer's Lemma, each node in the first hidden layer of the feature map computes 

at most {emn/{d+ 1))'^^^ functions on the nrn input vectors in x. Thus, there can be at most 
{emn/{d+ 1))'^''+^) distinct functions from the input to the output of the first hidden layer on 
the nm points in x. Fixing the first hidden layer parameters, each node in the second layer of the 
feature map computes at most {emn/{l + 1))^^^ functions on the image of x produced at the output 
of the first hidden layer. Thus the second hidden layer computes no more than {emn/{l + 
functions on the output of the first hidden layer on the nm points in x. So, in total. 



n$(n,m) < 



, l(d+l) / \ k(l+l) 

emn \ ^ 'I emn ^ ^ ' 



-V 



Now, for each possible matrix ?V|x' the number of functions computable on each row of $^y|x by a 
thresholded linear combination of the output of the feature map is at most {emjik + l))^"*"^. Hence, 
the number of binary sign assignments obtainable by applying Unear threshold functions to all the 
rows is at most (em/(A; + Thus, 



IIh (n,m) < 



emn \ ^ 'I emn \ ^ 'I emn ^ ^ ' 



d+lj V + lJ \n{k + l) 

f{x) := X log a; is a convex function, hence for all a,b,c > 0, 

/ (^) 



> - 



ka + lb + cj \a J \b J \c 

Substituting a = l + l, b = d+ l and c = n{k + 1) shows that 



^^, emn{k + l + l) \^^"^'^'^ 
mn,m)<[ J . (51) 
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Hence, if 



m > 



(W , \, (emn{k + l + l)\ 



(52) 



then Eii (^1, rn) < 2"™ and so by definition di^ (n) < m. For all a > 1, observe that x > a log2 x 
if X = 2a log2 2a. Setting a; = emn(k + / + 1)/ (W + n(A; + 1)) and a = e{k + / + 1) shows that 
(52) is satisfied if m = 2{W/n + A; + 1) log2(2e(A; + / + 1)). □ 

Theorem 16. Let H be as in Theorem 15 with the following extra restrictions: d > 3, I > k and 
k < d. Then 



dmin) > 



1 



W_ 
2n 



+ k + l 



Proof. We bound d{M.) and d{M) and then apply Lemma 10. In the present setting contains all 
three-layer Unear-threshold networks with d input nodes, / hidden nodes in the first hidden layer, k 
hidden nodes in the second hidden layer and one output node. From Theorem 13 in Bartlett (1993), 
we have 



VCdim(Hi) > dl + 



l{k 



+ 1, 



which under the restrictions stated above is greater than W/2. Hence d{M) > W/2. 

As k < d and I > k we can choose a feature weight assignment so that the feature map is the 
identity on k components of the input vector and insensitive to the setting of the reminaing d — k 
components. Hence we can generate k + 1 points in X whose image under the feature map is 
shattered by the linear threshold output node, and so d{M.) = k + 1. □ 



Combining Theorem 15 with Corrolary 13 shows that 



m> O 



1 



W 



+ k + l 



n 



1 1 1 1 1 

log - + - log - 

end 



examples of each task suffice when learning n tasks using a linear threshold hypothesis space family, 
while combining Theorem 16 with Theorem 14 shows that if 



m < 



W 



+ k + l 



n 



1 , 1 

+ - log T 
n 



then any learning algorithm will fail on some set of n tasks. 



4. Conclusion 

The problem of inductive bias is one that has broad significance in machine leaming. In this paper 
we have introduced a formal model of inductive bias leaming that appUes when the learner is able 
to sample from multiple related tasks. We proved that provided certain covering numbers computed 
from the set of all hypothesis spaces available to the bias learner are finite, any hypothesis space 
that contains good solutions to sufficiently many training tasks is likely to contain good solutions to 
novel tasks drawn from the same environment. 

In the specific case of leaming a set of features, we showed that the number of examples m 
required of each task in an n-task training set obeys m = 0{k + W/n), where k is the number of 
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features and is a measure of the complexity of the feature class. We showed that this bound is 
essentially tight for Boolean feature maps constructed from linear threshold networks. In addition, 
we proved that the number of tasks required to ensure good performance from the features on novel 
tasks is no more than 0{W). We also showed how a good set of features may be found by gradient 
descent. 

The model of this paper represents a first step towards a formal model of hierarchical approaches 
to learning. By modelhng a learner's uncertainty concerning its environment in probabilistic terms, 
we have shown how learning can occur simultaneously at both the base level — ^leam the tasks at 
hand — and at the meta-level — ^leam bias that can be transferred to novel tasks. From a technical 
perspective, it is the assumption that tasks are distributed probabilstically that allows the perfor- 
mance guarantees to be proved. From a practical perspective, there are many problem domains that 
can be viewed as probabilistically distributed sets of related tasks. For example, speech recognition 
may be decomposed along many different axes: words, speakers, accents, etc. Face recognition 
represents a potentially infinite domain of related tasks. Medical diagnosis and prognosis problems 
using the same pathology tests are yet another example. All of these domains should benefit from 
being tackled with a bias learning approach. 

Natural avenues for further enquiry include: 

• Alternative constructions for H. Although widely appUcable, the specific example on feature 

learning via gradient descent represents just one possible way of generating and searching 
the hypothesis space family H. It would be interesting to investigate alternative methods, 
including decision tree approaches, approaches from Inductive Logic Programming (Khan 
et al., 1998), and whether more general learning techniques such as boosting can be applied 
in a bias learning setting. 

• Algorithms for automatically determining the hypothesis space family H. In our model the 

structure of HI is fixed apriori and represents the hyper-bias of the bias learner. It would 
be interesting to see to what extent this structure can also be learnt. 

• Algorithms for automatically determining task relatedness. In ordinary learning there is usu- 

ally httle doubt whether an individual example belongs to the same learning task or not. 
The analogous question in bias learning is whether an individual learning task belongs to a 
given set of related tasks, which in contrast to ordinary learning, does not always have such 
a clear-cut answer. For most of the examples we have discussed here, such as speech and 
face recognition, the task-relatedness is not in question, but in other cases such as medical 
problems it is not so clear. Grouping too large a subset of tasks together as related tasks could 
clearly have a detrimental impact on bias-learning or multi-task learning, and there is empri- 
cal evidence to support this (Caraana, 1997). Thus, algorithms for automatically determining 
task-relatedness are a potentially useful avenue for further research. In this context, see Silver 
and Mercer (1996), Thrun and O'SuIIivan (1996). Note that the question of task relatedness 
is clearly only meaningful relative to a particular hypothesis space family H (for example, all 
possible collections of tasks are related if H contains every possible hypothesis space). 

• Extended hierarchies. For an extension of our two-level approach to arbitrarily deep hierarchies, 

see Langford (1999). An interesting further question is to what extent the hierarchy can 
be inferred from data. This is somewhat related to the question of automatic induction of 
structure in graphical models. 
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Appendix A. Uniform Convergence Results 

Theorem 2 provides a bound (uniform over all G H) on the probability of large deviation between 
eTQ{T-L) and erz('H). To obtain a more general result, we follow Haussler (1992) and introduce the 
following parameterized class of metrics on M+ : 

, r 1 \^-y\ 

dv [x,y\ := — ■ ■ — , 

X + y + v 

where v > 0. Our main theorem will be a uniform bound on the probability of large values of 

dy [erQ('H), erz('H)], rather than | eiQ{%) — erz('H)|. Theorem 2 will then follow as a corollary, as 
will better bounds for the realizable case e?rz('H) = (Appendix A.3). 

Lemma 17. The following three properties of are easily established: 

1. For all r, s > 0, < c?^ [r, s] < 1 

2. For all < r < s < t, d^ [r, s] < d^ [r, t] and d^ [s,t\ < d^ [r, t\. 
5. ForO < r,s < 1, ^ < d^ [r,s] < ^ 

For ease of exposition we have up until now been dealing expUcitly with hypothesis spaces T-L 
containing functions h: X ^ Y , and then constructing loss functions hi mapping X x y ^ [0, 1] 
by hi{x,y) := / (/i (a;), y) for some loss function / : YxY — )■ [0,1]. However, in general we can view 
hi just as a function from an abstract set Z {X x Y) to [0, 1] and ignore its particular construction 
in terms of the loss function So for the remainder of this section, unless otherwise stated, all 
hypothesis spaces T-L will be sets of functions mapping Z to [0, 1]. It will also be considerably more 
convenient to transpose our notation for {n, m)-samples, writing the n training sets as columns 
instead of rows: 

Zn ... Zln 

z = • ■-. • 

Zjjii . . . Zjjifi 

where each Zij G Z. RecalUng the definition of {X x y)("'™) (Equation 9 and prior discussion), 
with this transposition z lives in (X x y)(™'"). The following definition now generaUzes quantities 
like erz('H), erp('H) and so on to this new setting. 

Definition 6. Let Ki, . . . be n sets of functions mapping Z into [0,1]. For any hi G 
Hi, . . . ,hn € Hn, let hi ® ■ ■ ■ ® hn or simply h denote the map 



h{z) = l/n'^hi{zi) 



179 



Baxter 



for all z = {zi,. .. ,Zn) G Z". Let Hi ® ■ ■ ■ ® Hn denote the set of all such functions. Given 
h G 7^1 ffi • • • ffi Hn and m elements of {X x y)", {zi,. . . , z^) (or equivalently an element z of 
{X X y)(™'") by writing the Zi as rows), define 



^ m 



m 

2 = 1 



(recall equation (8)). Similarly, for any product probability measure P = Pi x ■ ■ ■ x P„ on 
(X X y)", define 

erp(h) := [ h{z)dP{z) 

(recall equation (26)). For any h, h' : (X x y)" [0, 1] (not necessarily of the form hi® - ■ ■ ©/i^), 
define 

dp(h,h') := /" |h(z) - h'{z)\dP{z) 
(recall equation (17)). For any class of functions H mapping (X x y)" to [0, 1], define 

C(e,^) := sup J\f{e,'H, dp) 
p 

where the supremum is over all product probabiUty measures on (X x y)" and J\f (e, Ti, dp) is the 
size of the smallest e-cover of H under dp (recall Definition 4). 

The following theorem is the main result from which the rest of the uniform convergence results 
in this paper are derived. 

Theorem 18. Let T-L C Hi © • • ■ ® T-L^ be a permissible class of functions mapping (X x Y)^ into 
[0, 1]. Let z G (X X y)(™'") be generated by m > 2/(a'^i^) independent trials from (X x y)" 
according to some product probability measure P = Pi x ■ ■ ■ x P„. For all v > Q < a < 1, 

Prjz G (X X y sup [erz(h), erp(h)] > a| 

< 4C(Q;z^/8,?{)exp(-Q;2iynm/8). (53) 
The following immediate corollary will also be of use later. 
Corollary 19. Under the same conditions as Theorem 18, if 

m>max.^^ — log M"^"^'^^' (54) 

I a'^vn a'^v I 

then 

Pr jz G (X X y)(™'") : supd^ [erz(h), erp(h)] > a| < 5 (55) 
A.l Proof of Theorem 18 

The proof is via a double symmetrization argument of the kind given in chapter 2 of Pollard (1984). 
I have also borrowed some ideas from the proof of Theorem 3 in Haussler (1992). 
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A. 1.1 First Symmetrization 

An extra piece of notation: for all z G (X x y)(2™:")^ let z(l) be the top half of z and z(2) be the 
bottom half, viz: 

■^11 ■ ■ ■ -2^m+l,l • • • -2^m+l,n 

z(l)= : •.. : z(2)= : ••. : 

Zml ■ ■ ■ Zjjin •22m, 1 ■ ■ ■ Z2m,n 

The following lemma is the first "symmetrization trick." We relate the probability of large deviation 
between an empirical estimate of the loss and the true loss to the probabihty of large deviation 
between two independent empirical estimates of the loss. 

Lemma 20. Let % be a permissible set of functions from {X x Y)^ into [0, 1] and let P be a 
probability measure on {X x y)". For all v > 0,0 < a < I and m > 

Prjz G (X X y)(™^"): sup [er2(/i), erp(/i)] > a| 

< 2Pr |z G : supd, [er,(i) (/i), er,(2) (Z^] > |} • (56) 

Proof. Note first that permissibility of guarantees the measurability of suprema over V, 
(Lemma 32 part 5). By the triangle inequaUty for d^, if d^j [eiz^i) (/i), erp(/i)] > a and 
[erz(2)(/i),erp(/i)] < q;/2, tiien d^, [erz(i) (/i), er2(2) (/i)] > a/2. Thus, 

Pr{z G (X X y)(2"^"): 3/1 G d, [er,(i)(/i), er,(2) W] > f } 

> Prjz G (X X y)(2™'") :3hen:d^ [ei^^i){h),eTp{h)] > a and 

^ . (57) 

d-i^ [erz(2)(/i)>erp(/i)] < a/2j. 

By Chebyshev's inequality, for any fixed h, 

Pr{z G (X X y)(™'"): [er,(/i), erp(/i)] < |} 

> Pr |z G (X X y)(™'") : 1^^^^ W - erp W| ^ || 
^^_erp(/i)(l-erp(/i)) 



1 

> - 
- 2 



as m > 2/(q!^i/) and erp(/i) < 1. Substituting this last expression into the right hand side of (57) 
gives the result. □ 
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A. 1.2 Second Symmetrization 

The second symmetrization trick bounds the probability of large deviation between two empirical 

estimates of the loss (i.e. the right hand side of (56)) by computing the probability of large devi- 
ation when elements are randomly permuted between the first and second sample. The following 
definition introduces the appropriate permutation group for this purpose. 

Definition 7. For all integers m, n > 1, let T(^2m,n) denote the set of all permutations a of the 
sequence of pairs of integers {(1, 1), . . . , (1, ra), . . . , (2m, 1), . . . , (2m, n)} such that for all i, I < 
i < m, either = (m + and cr(m + = («,j) or = and cr(m + = 

{m + i,j). 

For any z G (X x y)(2™'") and any cr G T(^2m,n)^ let 



Lemma 21. Let T-L = T-Li © • • • © 1-Ln be a permissible set of functions mapping {X x Y)^ into 
[0, 1] (as in the statement of Theorem 18). Fix z e [X x y)^^'".") and let U := {i^ , . . . , f^} be 
an av 1%-cover for {H, d^), where dz{\i^ h') := Yli=i ~ where the Zi are the rows 

of 2,. Then, 

Pr jcr G r(2m,„) : supc?^ [er2^(i)(h), 612^(2) (h)] > || 

M 

<^Pr{aGr(2™,„):rf. [er,,(i)(n,eV(2)(f')] > f } , (58) 

i=l 

where each a G r(2m,«) chosen uniformly at random. 

Proof. Fix a G r(2m,n) let h G be such that dj, [erz^(i)(h),er2;^(2)(h)] > a/2 (if there is 
no such h for any a we are already done). Choose f G "H such that dz{h., f) < au/8. Without loss 
of generality we can assume f is of the form f = /i © • • • © /„. Now, 

2 Ylt^i ^j(^ij) ~ 

-c?z(h,f) = 

v vmn 



> 



+ 



du [eiW(i)(h)'erz^(i)(f)] + [er2^(2) (h), er2^(2) (f)] . 
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-c?z(h,f)+d^ [er^^(i)(f),er2^(2)(f)] > [er2^(i)(h), 61^^(1) (f)] 



(59) 



Hence, by the triangle inequality for di,, 
2 

— ( 

V 

+ d„ [er^^(2)(h), 61^^(2) (f)] + d^ [61^^(1) (f), 61^^(2) (f)] 
> d^ [61^^(1) (h), 61^^(2) (h)] . 

But ^dz(h,f) < a/4: by construction and d^ [erz^(i)(h),erz^(2)(h)] > a/2 by assumption, so 
(59) implies d^, [erz^(i)(f),erz^(2)(f)] > a/4. Thus, 

{cr G ^{2m,n)- ^hen-. d^ [er2^(i)(h), 61-2^(2) (h)] > || 

C {ct G r(2m,„): 3f en-.d^ [61^^(1) (f), 61-2^(2) (f)]) > f } ' 

which gives (58). □ 
Now we bound the probability of each term in the right hand side of (58). 

Lemma 22. Let i : {X x y)" — >■ [0, 1] be any function that can be written in the form f = 
h®---®fn- For any z G (X x y)^^™."), 



( (y, ^ f — (y^ i^TftTi \ 
Prjcr G T^2m,n)■ du [61^^(1) (f), 61^^(2) (f)] > - 1 < 2 exp ( 1 

where each a G ^\2m,n) chosen uniformly at random. 
Proof For any cr G T^2m,ny 



(60) 



c^i/ [er2,(i)(f),eV(2)(f) 



(61) 



To simplify the notation denote fj{zij) by For each pair ij,l<i<m, l<j< n, let 
Yij be an independent random variable such that Yij = j3ij — Pm+i,j with probability 1/2 and 
Yij = Pm+i,j — Pij with probabihty 1/2. From (61), 



Prjcr G T(^2m,n)- d^, [61^^(1) (f), 61^^(2) (f)] > ^} 



Pr I 



G r 



(2m, n) • 



i=i i=i 



2m ra 

> - I vmn + ^^Pij 

i=i j=i 



Pr 



m ri 


a 1 









2m n 
vmn + ^ ^ jiij 
i=i j=i 



For zero-mean independent random variables Yi, . . . , with bounded ranges ai <Yi < hi, Ho- 
effding's inequaUty (Devroye, Gyorfi, & Lugosi, 1996) is 



Pr 



2 = 1 



> > < 2 exp 



2rf 



.^2 
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Noting that the range of each Yij is - - Pi+m,j)\], we have 



Pr 



EE*-. 

i=i j=i 



2m n 

> J I umn + \ <2 exp 



( o? 



\ 



Let 7 = E'ri ELi fti- As < ft,- < 1, 1 ELi(Ai - Pm+ij? < 7- Hence, 



2 exp 



^^mn + E^ri E -.1 



V 



32E™lE-=l(A.-A+m,.)^ 



< 2 exp 



a^^umn + 7)^ 
32^y 



{vmn + 7)^/7 is minimized by setting 7 = i^mn giving a value of 4i^mn. Hence 



G r(2™,„): c?^ [er2^(i)(f),er2^(2)(f)] > - | < 2exp ( ^ j 



as required. 



□ 



A. 1.3 Putting IT Together 

For fixed z G (X x y)(2™."), Lemmas 21 and 22 give: 

Pr jcr G r(2m,„) : supc?^ [erz^(i)(h),erz^(2)(h)] > || 



< 2A/'(Q;zv/8,^,c?z))exp 



o?vmn\ 

8~y 



Note that c?z is simply c?p where P = (Pi, . . . , P„) and each Pj is the empirical distribution that 
puts point mass 1/m on each Zjj, j = 1, . . . , 2m (recall Definition 3). Hence, 

Pr{a G r(2™,„),z G (X X y)(2'"."): supd, [eV(i)(h),er,^(2)(h)] > || 

< 2C {au/8, U) exp ( j . 

Now, for a random choice of z, each Zij in z is independently (but not identically) distributed and a 
only ever swaps zij and Zi^mj (so that ct swaps a 2;ij drawn according to Pj with another component 
drawn according to the same distribution). Thus we can integrate out with respect to the choice of 
a and write 



Prjz G (X X y)M: supd, [er,(i) (h), er,(2) (h)] > || 



< 2C {au/S^n) exp 



avmn 



8 J 



Applying Lemma 20 to this expression gives Theorem 18. 



□ 
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A.2 Proof of Theorem 2 

Another piece of notation is required for the proof. For any hypothesis space % and any probability 
measures P = (Pi, . . . , P„) on Z, let 



1 " 

erp(H) := - inf erp (/i). 



n ^ hen 

t—i 

Note that we have used erp(H) rather than erp(H) to indicate that erp(H) is another empirical 

estimate of eTQ{'H). 

With the (n, m)-sampling process, in addition to the sample z there is also generated a se- 
quence of probabihty measures, P = (Pi, . . . , P„) although these are not supplied to the leamer. 
This notion is used in the following Lemma, where Pr{(z, P) G (X x y)("'™) x : A} means 
"the probability of generating a sequence of measures P from the environment ("P, Q) and then an 
(n, m)-sample z according to P such that A holds". 

Lemma 23. If 

Pr |(z,P) G (X X y)("'™) X P": sup [er,(^), erp(^)] > ^| < ^, (62) 



and 



then 



Pr jp GP": sup [erp (H), erg (H)] > ?| < 1 (63) 
L H Z ) z 

Prjz G (X X y)("'™): sup [erz(H), erQ(H)] > a| < 5. 



Proof. Follows directly from the triangle inequality for c?j,. □ 
We treat the two inequalities in Lemma 23 separately. 

A.2.1 Inequality (62) 

In the following Lemma we replace the supremum over H G H in inequahty (62) with a supremum 
over h G M". 

Lemma 24. 

Pr|(z,P) G (X X y)("'™) X T'": sup [erz(^), erp(^)] > a| 

< Pr J (z,P) G (X X y)("'™) X P": supd^ [erz(h), erp(h)] > a \ (64) 
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Proof. Suppose that (z,P) are such that swp^ d^, [erz('H), erp('H)] > a. Let H satisfy this in- 
equahty. Suppose first that eTz{'H) < erp('H). By the definition of eiz{'H), for all e > there 
exists h G "H" := "H ® • • • ® "H such that erz(h) < erz('H) + e. Hence by property (3) of the dj, 
metric, for all e > 0, there exists h G H"' such that d„ [erz(h), erz(?^)] < e. Pick an arbitrary h 
satisfying this inequahty. By definition, eip{H) < erp(h), and so erz{'H) < eip{'H) < erp(h). 
As di, [erz('H), erp('H)] > a (by assumption), by the compatibility of d,^ with the ordering on the 
reals, di, [erz('H), erp(h)] > a = a + 5, say. By the triangle inequality for d^, 

dj, [erz(h),erp(h)] + d,, [erz(h), erz(H)] > d,, [erz(^), erp(h)] = a + 5. 

Thus dj, [erz(h), erp(h)] > a + 5 — e and for any e > an h satisfying this inequality can be 
found. Choosing e = 5 shows that there exists h G "H" such that d^ [erz(h), erp(h)] > a. 

If instead, erp('H) < ei^{'H), then an identical argument can be run with the role of z and P 
interchanged. Thus in both cases, 

supd,, [erz(7{),erp('H)] > a ^ 3h G : dj, [erz(h), erp(h)] > a, 

which completes the proof of the Lemma. □ 
By the nature of the (n, m) sampling process. 



■I 

pg-pn 



Pr^ (z,P) G (X X y)^"'™^ xP"sup: [erz(h), erp(h)] > a) 

K J 

Pr J z G (X X y)("'™) : supd^ [erz(h), erp(h)] > a \ dQ^'iP). (65) 



Now C K®---®K where K := {hr. h ETi: % E H} and is permissible by the assumed 
permissibility of H (Lemma 32, Appendix D). Hence satisfies the conditions of Corollary 19 
and so combining Lemma 24, Equation (65) and substituting q;/2 for a and (5/2 for (5 in Corollary 
19 gives the following Lemma on the sample size required to ensure (62) holds. 



Lemma 25. If 



, 32 , 8C(au/l&,m?) 8 
m > max < — ^ — log 



9 o £ 5 9 

a'^vn a^v 



then 

Pr ((z,P) G (X X y)("'™) X P": sup [erz(^), erp(^)] > ^| < ^. 
A. 2.2 Inequality (63) 

Note that eipC^) = i ^^^i ^*(-Pi) and eiQ[Wj = ¥.p^qU*{P), i.e the expectation of %*{P) 
where P is distributed according to Q. So to bound the left-hand-side of (63) we can apply Corollary 
19 with n = \,m replaced by n, T-L replaced by H*, a and 5 replaced by q;/2 and 5/2 respectively, 
P replaced by Q and Z replaced by V. Note that H* is permissible whenever H is (Lemma 32). 
Thus, if 

r 32 , 8C(aWl6,M*) 8 1 
n>max<^^log ^ ' -.^\ (66) 
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then inequality (63) is satisfied. 

Now, putting together Lemma 23, Lemma 25 and Equation 66, we have proved the following 

more general version of Theorem 2. 

Theorem 26. Let M be a permissible hypothesis space family and let z be an (n, m)-sample gen- 
erated from the environment {V, Q). For all < a, S < I and v > 0, if 



n > max 



f 32 , 8C(q;W16, H*) 8 1 
S — log ^ , — > 



, 32 , 8C(aWl6, H") 8 1 
and m > max < — ^ — log ; — , — ^ > , 



a^un 



then 



Pr jz G (X X y)("'™): supd^ [erz(H), erQ(H)] > a] < 6 



To get Theorem 2, observe that eiQ{T-L) > erz('H) + e ^ d^, [erz('H), erQ('H)] > e/(2 + zv). 
Setting a = e/{2 + v) and maximizing a^v gives = 2. Substituting a = e/4 and = 2 into 
Theorem 26 gives Theorem 2. 

A.3 The Realizable Case 

In Theorem 2 the sample complexity for both m and n scales as This can be improved to 

1/e if instead of requiring eiQ{T-L) < eiz{7i) + e, we require only that eiQ{%) < Kerz('H) + £ 
for some k > L To see this, observe that exQ{'H) > erz('H)(l + a)/{l — a) + — a) ^ 

dv [erz(^), erQ(^)] > a, so setting av/ (1 — a) = e in Theorem 26 and treating a as a constant 
gives: 

Corollary 27. Under the same conditions as Theorem 26, for all e > and < a, 5 < 1, if 

8C((1 - Q;)e/16, H*) 8 



f 32 

\ a{l — a) 



n > max <^ — — log wi \ 

Kjs a[l — a)£ 

r 32 , 8C((1 -a)e/16,H") 8 1 

and m>max<^— ^ — log —^—T, Tf' 

la(l — a)£n o a(l — a)£ ) 

then 

Pr |z G (X X y)^"'™^ : superQ(H) > ^^erz(H) + el < ^■ 
I H 1 - " J 

These bounds are particularly useful if we know that erz('H) = 0, for then we can set a = 1/2 
(which maximizes a{l — a)). 

Appendix B. Proof of Theorem 6 

RecalUng Definition 6, for H of the form given in (32), can be written 

= {Pi ° / ffi • • • ffi P« ° / : Pi : • • • > 5« e Gi and f e T} . 

To write H" as a composition of two function classes note that if for each / : X — F we define 

/: {X X y)" (F X y)"by 

f{xi,yi,...,Xn,yn) ■= {f{xi),yi,...,f{Xn),yn) 
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then 91 o /© . . ■ ©g„ o f = g^® ■ ■ ■ ® g^o f. Thus, setting := ^; © ■ ■ ■ ©0; and := {/ : / G 



H ^QloT. (67) 
The following two Lemmas will enable us to bound C (e, H"). 

Lemma l^. Let H: X xY ^ [0, 1] i»e of the form H = Gi o T where X xY ^ V xY ^ 
[0,1]. For all ei, 62 > 0, 

Proof. Fix a measure P on X x y and let F be a minimum size ei-cover for [T^ c?[pgj]). By 
definition \F\ < Cg^ {ei^T). For each / G F let Pj be the measure on F x y defined by Pj{S) = 
P{f~^{S)) for any set S in the cr-algebra on F x y (/ is measurable so f~^{S) is measurable). 
Let Gf be a minimum size £2-cover for {Qi,dp^). By definition again, \Gf\ < C{£2,Gi)- Let 
N:={gof:fe Fandg G Of}. Note tiiat |7V| < Cg, (ei, .F)C(e2, so the Lemma will be 
proved if N can be shown to be an ei + e2-cover for (T-L, dp). So, given any g o f ^ T-L choose 
f e F such that c?[p,0,] (/, /') < ei and g' e Gf such that dp^, {g, g') < e^. Now, 

dp{g of,g'o /') <dp{gof,go /') + C?p(5 o /', g' o /') 
<V,0i](/'/')+c?P,,(9,9') 
< ei + £2. 

where the first line follows from the triangle inequality for dp and the second line follows from 
tiie facts: dp{g o f',g' o /') = dp^{g,g') and dp{g o /,g o /') < d[p^g^]{fj'). Thus TV is an 
El + £2-cover for {H, dp) and so the result follows. □ 

Recalhng the definition of "Hi © • • • ffi Hn (Definition 6), we have the following Lemma. 
Lemma 29. 

n 

C{£,Ul®---®Un) < J[C{£,Ui) 

2=1 

Proof. Fix a product probability measure P = Pi x • • • x P„ on (X x y)". Let iVi, . . . , iV„ be 
£-covers of (T^i, dpj . . . , {Un.dpJ. and let TV" = TVi © • • • © Nn. Given /i = /ii © • • • © /i„ G 
^1 © ■ ■ ■ © T-Ln^ choose 51 © • • • © g„ G TV such that dp. {hi,gi) < e for each « = 1, . . . , n. Now, 



c?p(/ii©---©/i„,5i©---©g„) = - / 

n Jz 



^hi{zi) - ^gi{zi) 

2=1 2=1 



d'P{zi,...,Zn) 



1 

2 = 1 

< £■ 

Thus TV is an £-cover for Hi © ■ ■ ■ © H„ and as |TV| = fJLi l^i I result follows. □ 
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B.l Bounding C (e, Hp) 

From Lemma 28, 

and from Lemma 29, 



(68) 



(69) 



Using similar techniques to those used to prove Lemmas 28 and 29, Cgf^ie^T) can be shown to 
satisfy 



Equations (67), (68), (69) and (70) together imply inequaUty (34). 



(70) 



B.2 Bounding C(e,H*) 

We wish to prove that C (e, H*) < Cq^ (e, T) when H is a hypothesis space family of the form 
H = o / : / G Note that each € M* corresponds to some Gi o /, and that 

H*{P)= inf erp(5o/). 

Any probability measure Q onV induces a probability measure QxxY on X x y, defined by 

QxMS) = [ P{S)dQ{P) 
Jv 

for any S in the cr-algebra on X xY. Note also that if /i, h' are bounded, positive functions on an 
arbitrary set A, then 



inf h{a) — inf h'{a) 

aeA aeA 



< sup \ h{a) — h'{a) | 



(71) 



Let Q be any probability measure on the space V of probability measures on X x y. Let H*, H| 
be two elements of H* with corresponding hypothesis spaces Qi o fi^Qi o f2. Then, 



inf eip{g o /i) - inf erp(g o /g) 
g&Gi g&Gi 



dQ{P) 



< / sup \eip{g o /i) - eip{g o /s)] dQ{P) (by (71) above) 

Jv geSi 

< sup \g o fi{x,y) - go f2{x,y)\ dP{x,y)dQ{P) 
Jv JxxY geGi 

= '^[QxxY,gi]ih^h)- 

The measurability of supg^ g o f is guaranteed by the permissibility of H (Lemma 32 part 4, Ap- 
pendix D). From dqini^n^) < d[Q^^Y,Gi]ihJ2) we have, 

X (e, m\dQ) < N (£, d[Q^,y,e,]) ' (72) 

wliich gives inequality (35). 
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B.3 Proof of Theorem? 

In order to prove the bounds in Theorem 7 we have to apply Theorem 6 to the neural network 
hypothesis space family of equation (39). In this case the structure is 

A 4 [0, 1] 

where Q = {{xi, ... ,Xk) a "i^i + "o) : («o, ai, ■ ■ ■ , a*;) G U} for some bounded 

subset U of 1^^^+^ and some Lipschitz squashing function a. The feature class J^: M'' 
is the set of all one hidden layer neural networks with d inputs, I hidden nodes, k outputs, a as 
the squashing function and weights w £ T where T is a bounded subset of . The Lipschitz 
restriction on cr and the bounded restrictions on the weights ensure that T and G are Lipschitz 
classes. Hence there exists b < oo such that for all / G and x,x' G W'', \\f{x) — < 
b\\x — x'W and for all g € Q and x, a;' G M*^, \g{x) — g{x')\ < b\\x — x'\\ where || ■ || is the Li norm 
in each case. The loss function is squared loss. 

Now, gi{x, y) = l{g{x),y) = {g{x) — y)^, hence for all g,g' € Q and all probability measures 
P on M*^ X [0, 1] (recall that we assumed the output space Y was [0, 1]), 

dp{gi,g'i) = [ \{g{v)-yf - {g'{v)-yf\ dP{v,y) 

JM*^x[0,l] 

<2 [ \g{v)-g'{v)\dPj,,{v), (73) 



where P^^k is the marginal distribution on m!' derived from P. Similarly, for all f,f'£!F and 
probability measures P on M'^ x [0, 1], 

d[p^g,]{fj')<2b f \\f{x)-f'{x)\\dP^4x). (74) 

jRd. 

Define 

C{e,g,L^) := sup Af{e,g,L^{P)) , 
p 

where the supremum is over all probability measures on (the Borel subsets of) M'', and 
J\f {£,g, L} (P)) is the size of the smallest e-cover of Q under the L} (P) metric. Similarly set, 

C (e,.F,L^) := supA/'(e,.F,L^(P)) , 
p 

where now the supremum is over all probability measures on . Equations (73) and (74) imply 

C(e,e/) <c(|,e?,Li) (75) 

Ca,(e,.F) <c(^,.F,Li) (76) 
Applying Theorem 11 from Haussler (1992), we find 

^2eb^ 



2k+2 



c(f.ft.L')< 

Substituting these two expressions into (75) and (76) and applying Theorem 6 yields Theorem 
7. □ 
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Appendix C. Proof of Theorem 14 



This proof follows a similar argument to the one presented in Anthony and Bartlett (1999) for 
ordinary Boolean function learning. 
First we need a technical Lemma. 

Lemma 30. Let a be a random variable uniformly distributed on {1/2 + /3/2, 1/2 — /3/2}, with 
< ^ < 1. Let ^i, . . . be i.i.d. {1, —l}-valued random variables with Pr(^j = 1) = afar all 
i. For any fanction f mapping {1, —1}" — {1/2 + /3/2, 1/2 — /3/2}, 



Pr{ei,...,em:/(Ci,---,Cm)/«}>4 



1 - y/i 



Proof. Let N{^) denote the number of occurences of +1 in the random sequence ^ = (^i, . . . , ^m)- 
The function / can be viewed as a decision rule, i.e. based on the observations ^, / tries to guess 
whether the probabiUty of-|-lisl/2 + /3/2orl/2 — /3/2. The optimal decision rule is the Bayes 
estimator: /(^i, . . . , = 1/2 + p/2 if N{0 > m/2, and /(^i, . . . , Cm) = 1/2 - P/2 otherwise. 
Hence, 



1 /3 
a = 

2 2 



2 2 



which is half the probabiUty that a binomial (m, 1/2 — ^/2) random variable is at least m/2. By 
Slud's inequaUty (Slud, 1977), 



Pr{f{0^a)>-Pv{Z> 



where Z is normal (0, 1). Tate's inequality (Tate, 1953) states that for all x >0, 

1 



Pt{Z >x)> 



Combining the last two inequalities completes the proof. 



□ 



Let X € X^"'™) be shattered by H, with m = dm{n). For each row i in x let Vi be the set of 
all 2'^ distributions P on X x {±1} such that P{x, 1) = P{x, 0) = if x is not contained in the 
ith row of X, and for each j = 1, . . . ,dm{n), P{xij, 1) = (1 ± ^)/(2c?ii(n)) and P{xij, —1) = 
(1 T /3)/(2dH(n)). Let := Pi X • • • X P„. 

Note that for P = (Pi, . . . , P„) g V, the optimal error optp(]H[") is achieved by any sequence 
h* = {hi,..., h*J such that h*{xij) = 1 if and only if Pj(a;ij-, 1) = (1 + /3)/(2dHH), and H" 
always contains such a sequence because H shatters x. The optimal error is then 



optj 



.^ III .^IV"■|^l\•"J-^ 

(H") = erp(h*) = -5]P,{/z:(x)/y} = -^ Y: ^ 



/3 l-P 



2 = 1 
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and for any h = {hi,. . . , G H", 



erp(h) = optp(H") + 



ndmin) 

For any (n, m)-sample z, let each element rriij in the array 

"ill ••• mdsin) 



hiixij) / h*{xij)}\ 



ill) 



m(z) : = 



equal the number of occurrences of Xij in z. 

Now, if we select P = (Pi, . . . , P„) uniformly at random from V, and generate an (n, m)- 
sample z using P, then for h = .4„(z) (the output of the learning algorithm) we have: 

E(|{(i,j): hiixij) ^ hUxij)}\) = J2Pi^)^i\{ii^j)- hiixij) ^ hUxij)}\ |m) 

m 

n dB{n) 

= ^P(m)^ P{h{xij)^h%Xij)\mij) 

m i=l j=l 

where P(m) is the probability of generating a configuration m of the Xij under the (n, m)-sampling 
process and the sum is over all possible configurations. From Lemma 30, 



P{h{xij) / h* {xij)\mij) > - 



1 - A/ 1 - e 1- 



hence 



E 



1 



ndm{n) 



hi{xij) / h*{xij)}\ 



m 2=1 j=l 



1 - A/ 1 - e 1- 



> 



1 



1- Vi 



g de(n)(l-/3^) 



(78) 



by Jensen's inequality. Since for any [0, l]-valued random variable Z, Pr(Z > x) > KZ — x, (78) 
implies: 

Pr ( V / > ja) > {1 - j)a 

\nam(n) ) 

where 



i-vT 



g c!e(n)(l-/3^) 



(79) 



and 7 G [0, 1]. Plugging this into (77) shows that 

Pr{(P,z): erp(^„(z)) >optp(H")+7a/3} > (1 - 7)«. 
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Since the inequality holds over the random choice of P, it must also hold for some specific choice 
of P. Hence for any learning algorithm An there is some sequence of distributions P such that 



Setting 



ensures 



Pr{z: erp(^„(z)) > optp(H") + 70^} > (1-7)0;. 



(1 — 7)0 > S, and jaP > e, 



Pr{z: erp(^„(z)) > optp(H") + e} > 5. 
Assuming equality in (80), we get 

6 „ e 1 — 7 



(80) 



(81) 



a 



1-7' 



7 



Solving (79) for m, and substituting the above expressions for a and /3 shows that (81) is satisfied 
provided 



m < dm{n) 



7 



1-7 



log 



(i-7r 



86 {I - -J -26) 



(82) 



Setting 7 = 1 — a6 foT some a > 4 (a > 4 since a < 1/4 and a = 6 /{I — 7)), and assuming 
£,6 < l/{ka) for some k > 2, (82) becomes 



m < 



dm{n) 



log 



8(a-2)' 



(83) 



Subject to the constraint a > 4, the right hand side of (83) is approximately maximized at a = 
8.7966, at which point the value exceeds dm{n){l - 2/k)/{220e'^). Thus, for all A; > l,if£,5 < 
l/9k and 



m < 



duin) (1-f) 



220e2 



(84) 



then 



Pr{z: erp(>l„(z)) > optp(H") + e} > 5. 

To obtain the 5-dependence in Theorem 14 observe that by assumption contains at least two 

functions /ii, /i2, hence there exists an a; G X such that /ii (a;) / h2{x). Let be two distributions 
concentrated on (rr, 1) and {x, —1) such that P"^ {x,hi{x)) = (1 ± e)/2 and P"^ {x,h2{x)) = 
(1 =F £)/2. Let P+ := P"*" x • • • x P+ and P^ := P x • • • x P be the product distributions on 
{X x {±1})" generated by P^, and hi := {hi,.. . ,/ii),h2 := (/i2,. . . ,/i2)- Note that hi and h2 
are both in H". If P is one of P^ and the leaming algorithm An chooses the wrong hypothesis h, 
then 

erp(h) -optp(H") = s. 
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Now, if we choose P uniformly at random from {P+, P } and generate an (n, m)-sample z ac- 
cording to P, Lemma 30 shows that 



Pr{(P,z): erp(^„(z)) >optp(H")+£} > - 
which is at least 6 if 

"^<^^°S8^(T^ ^^^^ 

provided < 5 < 1/4. Combining the two constraints on m: (84) (with k = 7) and (85), and using 
maxja;!, 2:2} > + X2) finishes the proof. □ 



nme 

e 1 



Appendix D. Measurability 

In order for Theorems 2 and 18 to hold in full generality we had to impose a constraint called 
"permissibility" on the hypothesis space family H. Permissibihty was introduced by Pollard (1984) 
for ordinary hypothesis classes H. His definition is very similar to Dudley's "image admissible 
Suslin" (Dudley, 1984). We will be extending this definition to cover hypothesis space famiUes. 

Throughout this section we assume all functions h map from (the complete separable metric 
space) Z into [0, 1]. Let B{T) denote the Borel cr-algebra of any topological space T. As in Section 
2.2, we view V, the set of all probability measures on Z, as a topological space by equipping it 
with the topology of weak convergence. B{V) is then the cr-algebra generated by this topology. The 
following two definitions are taken (with minor modifications) from Pollard (1984). 

Definition 8. A set T-L o/[0, l\-valued functions on Z is indexed by the set T if there exists a function 
/ : Z X T [0, 1] such that 

n = {f{-,t):teT}. 
Definition 9. The set T-L is permissible if it can be indexed by a set T such that 

1. T is an analytic subset of a Polish' space T, and 

2. the function / : Z x T — )■ [0, 1] indexing T-L by T is measurable with respect to the product 
u -algebra B{Z)®B{T). 

An analytic subset T of a Polish space T is simply the continuous image of a Borel subset X 
of another Polish space X. The analytic subsets of a PoUsh space include the Borel sets. They 
are important because projections of analytic sets are analytic, and can be measured in a complete 
measure space whereas projections of Borel sets are not necessarily Borel, and hence caimot be 
measured with a Borel measure. For more details see Dudley (1989), section 13.2. 

Lemma 31. T-Li® ■ ■ ■ ® Tin '■ {X x y)" [0, 1] is permissible ifTii, . . . , Tin cire all permissible. 

Proof Omitted. □ 

We now define permissibility of hypothesis space families. 
7. A topological space is called Polish if it is metrizable such that it is a complete separable metric space. 
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Definition 10. A hypothesis space family M = {T-L} is permissible if there exist sets S and T that 
are analytic subsets of Polish spaces S and T respectively, and a function f : Z x T x 5 ^ [0, 1], 
measurable with respect to S ® B{T) ® B{S), such that 

H={{/(-,i,s):iGT}:sG5}. 

Let (X, be a measure space and T be an analytic subset of a Polish space. Let A{X) 
denote the analytic subsets of X. The following three facts about analytic sets are taken from 
Pollard (1984), appendix C. 

(a) If {X, E, /i) is complete then A{X) C S. 

(b) A{X X T) contains the product cr-algebra S ® B{T). 

(c) For any set Y in A{X x T), the projection t^xY of Y onto X is in A{X). 

Recall Definition 2 for the definition of H*. In the following Lemma we assume that [Z, B{Z)) 
has been completed with respect to any probability measure P, and also that (P, B{V)) is complete 
with respect to the environmental measure Q. 

Lemma 32. For any permissible hypothesis space family H, 

1. H" /.V permissible. 

2. {h ^ T-L: T-L ^ is permissible. 

3. T-L is permissible for all T-L G H. 

4. sup^ and inf^ are measurable for all T-L & M. 

5. T-L* is measurable for all T-L G H. 

6. H* is permissible. 

Proof. As we have absorbed the loss function into the hypotheses h, H" is simply the set of all 
n-fold products H®---®H such that H G H. Thus (1) follows from Lemma 31. (2) and (3) 
are immediate from the definitions. As % is permissible for all G H, (4) can be proved by an 
identical argument to that used in the "Measurable Suprema" section of Pollard (1984), appendix 
C. 

For (5), note that for any Borel-measurable /i: Z ^ [0, 1], the function /i: 'P ^ [0, 1] defined 
by h{P) := h{z) dP{z) is Borel measurable Kechris (1995, chapter 17). Now, permissibility of 
Ti automatically implies permissibiUty of H := {h : h e T-L}, and T-L* = inf^ so T-L* is measurable 
by (4). 

Now let H be indexed hy f: Z x T x S — )• [0, 1] in the appropriate way. To prove (6), 
define g: V x T x S ^ [0,1] g{P,t,s) := f{z,t,s) dP{z). By Fubini's theorem ^ is a 
B{V) ® B{T) ® ;B(S')-measurable function. Let G: V x S ^ [0, 1] be defined by G{P,s) := 
inftgT g{P, t,s). G indexes H* in the appropriate way for H* to be permissible, provided it can 
be shown that G is BlV) (g) 5*) -measurable. This is where analyticity becomes important. Let 
Qa ■= {{P,t,s): g{P,t,s) > a}. By property (b) of analytic sets, A{V xT x S) contains g^- 
The set Ga '■= {{P, s) : G{P, s) > a} is the projection of ga onto V x S, which by property (c) is 
also analytic. As (P, B{V), Q) is assumed complete, Ga is measurable, by property (a). Thus G is 
a measurable function and the permissibility of H* follows. □ 
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